Module Ferret::Analysis
In: ext/r_analysis.c

Summary

The Analysis module contains all the classes used to analyze and tokenize the data to be indexed. There are three main classes you need to know about when dealing with analysis; Analyzer, TokenStream and Token.

Classes

Analyzer

Analyzers handle all of your tokenizing needs. You pass an Analyzer to the indexing class when you create it and it will create the TokenStreams necessary to tokenize the fields in the documents. Most of the time you won‘t need to worry about TokenStreams and Tokens, one of the Analyzers distributed with Ferret will do exactly what you need. Otherwise you‘ll need to implement a custom analyzer.

TokenStream

A TokenStream is an enumeration of Tokens. There are two standard types of TokenStream; Tokenizer and TokenFilter. A Tokenizer takes a String and turns it into a list of Tokens. A TokenFilter takes another TokenStream and post-processes the Tokens. You can chain as many TokenFilters together as you like but they always need to finish with a Tokenizer.

Token

A Token is a single term from a document field. A token contains the text representing the term as well as the start and end offset of the token. The start and end offset will represent the token as it appears in the source field. Some TokenFilters may change the text in the Token but the start and end offsets should stay the same so (end - start) won‘t necessarily be equal to the length of text in the token. For example using a stemming TokenFilter the term "Beginning" might have start and end offsets of 10 and 19 respectively ("Beginning".length == 9) but Token#text might be "begin" (after stemming).

Classes and Modules

Class Ferret::Analysis::Analyzer
Class Ferret::Analysis::AsciiLetterAnalyzer
Class Ferret::Analysis::AsciiLetterTokenizer
Class Ferret::Analysis::AsciiLowerCaseFilter
Class Ferret::Analysis::AsciiStandardAnalyzer
Class Ferret::Analysis::AsciiStandardTokenizer
Class Ferret::Analysis::AsciiWhiteSpaceAnalyzer
Class Ferret::Analysis::AsciiWhiteSpaceTokenizer
Class Ferret::Analysis::HyphenFilter
Class Ferret::Analysis::LetterAnalyzer
Class Ferret::Analysis::LetterTokenizer
Class Ferret::Analysis::LowerCaseFilter
Class Ferret::Analysis::PerFieldAnalyzer
Class Ferret::Analysis::RegExpAnalyzer
Class Ferret::Analysis::RegExpTokenizer
Class Ferret::Analysis::StandardAnalyzer
Class Ferret::Analysis::StandardTokenizer
Class Ferret::Analysis::StemFilter
Class Ferret::Analysis::StopFilter
Class Ferret::Analysis::Token
Class Ferret::Analysis::TokenStream
Class Ferret::Analysis::WhiteSpaceAnalyzer
Class Ferret::Analysis::WhiteSpaceTokenizer

Constants

ENGLISH_STOP_WORDS = get_rstopwords(ENGLISH_STOP_WORDS)
FULL_ENGLISH_STOP_WORDS = get_rstopwords(FULL_ENGLISH_STOP_WORDS)
EXTENDED_ENGLISH_STOP_WORDS = get_rstopwords(EXTENDED_ENGLISH_STOP_WORDS)
FULL_FRENCH_STOP_WORDS = get_rstopwords(FULL_FRENCH_STOP_WORDS)
FULL_SPANISH_STOP_WORDS = get_rstopwords(FULL_SPANISH_STOP_WORDS)
FULL_PORTUGUESE_STOP_WORDS = get_rstopwords(FULL_PORTUGUESE_STOP_WORDS)
FULL_ITALIAN_STOP_WORDS = get_rstopwords(FULL_ITALIAN_STOP_WORDS)
FULL_GERMAN_STOP_WORDS = get_rstopwords(FULL_GERMAN_STOP_WORDS)
FULL_DUTCH_STOP_WORDS = get_rstopwords(FULL_DUTCH_STOP_WORDS)
FULL_SWEDISH_STOP_WORDS = get_rstopwords(FULL_SWEDISH_STOP_WORDS)
FULL_NORWEGIAN_STOP_WORDS = get_rstopwords(FULL_NORWEGIAN_STOP_WORDS)
FULL_DANISH_STOP_WORDS = get_rstopwords(FULL_DANISH_STOP_WORDS)
FULL_RUSSIAN_STOP_WORDS = get_rstopwords(FULL_RUSSIAN_STOP_WORDS)
FULL_FINNISH_STOP_WORDS = get_rstopwords(FULL_FINNISH_STOP_WORDS)

[Validate]