Class Ferret::Analysis::Analyzer
In: ext/r_analysis.c
Parent: Object

Summary

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilter s may then be applied to the output of the Tokenizer.

The default Analyzer just creates a LowerCaseTokenizer which converts all text to lowercase tokens. See LowerCaseTokenizer for more details.

Example

To create your own custom Analyzer you simply need to implement a token_stream method which takes the field name and the data to be tokenized as parameters and returns a TokenStream. Most analyzers typically ignore the field name.

Here we‘ll create a StemmingAnalyzer;

  def MyAnalyzer < Analyzer
    def token_stream(field, str)
      return StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)))
    end
  end

Methods

new   token_stream  

Public Class methods

Create a new LetterAnalyzer which downcases tokens by default but can optionally leave case as is. Lowercasing will be done based on the current locale.

lower:set to false if you don‘t want the field‘s tokens to be downcased

Public Instance methods

Create a new TokenStream to tokenize input. The TokenStream created may also depend on the field_name. Although this parameter is typically ignored.

field_name:name of the field to be tokenized
input:data from the field to be tokenized

[Validate]