Class | Ferret::Analysis::Analyzer |
In: |
ext/r_analysis.c
|
Parent: | Object |
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.
Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilter s may then be applied to the output of the Tokenizer.
The default Analyzer just creates a LowerCaseTokenizer which converts all text to lowercase tokens. See LowerCaseTokenizer for more details.
To create your own custom Analyzer you simply need to implement a token_stream method which takes the field name and the data to be tokenized as parameters and returns a TokenStream. Most analyzers typically ignore the field name.
Here we‘ll create a StemmingAnalyzer;
def MyAnalyzer < Analyzer def token_stream(field, str) return StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str))) end end
Create a new LetterAnalyzer which downcases tokens by default but can optionally leave case as is. Lowercasing will be done based on the current locale.
lower: | set to false if you don‘t want the field‘s tokens to be downcased |
Create a new TokenStream to tokenize input. The TokenStream created may also depend on the field_name. Although this parameter is typically ignored.
field_name: | name of the field to be tokenized |
input: | data from the field to be tokenized |