Class Ferret::Analysis::Token
In: ext/r_analysis.c
Parent: Object

Summary

A Token is an occurrence of a term from the text of a field. It consists of a term‘s text and the start and end offset of the term in the text of the field;

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.

Attributes

text:the terms text which may have been modified by a Token Filter or Tokenizer from the text originally found in the document
start:is the position of the first character corresponding to this token in the source text
end:is equal to one greater than the position of the last character corresponding of this token Note that the difference between @end_offset and @start_offset may not be equal to @text.length(), as the term text may have been altered by a stemmer or some other filter.

Methods

<=>   end   end=   new   pos_inc   pos_inc=   start   start=   text   text=   to_s  

Included Modules

Comparable

Public Class methods

Creates a new token setting the text, start and end offsets of the token and the position increment for the token.

The position increment is usually set to 1 but you can set it to other values as needed. For example, if you have a stop word filter you will be skipping tokens. Let‘s say you have the stop words "the" and "and" and you parse the title "The Old Man and the Sea". The terms "Old", "Man" and "Sea" will have the position increments 2, 1 and 3 respectively.

Another reason you might want to vary the position increment is if you are adding synonyms to the index. For example let‘s say you have the synonym group "quick", "fast" and "speedy". When tokenizing the phrase "Next day speedy delivery", you‘ll add "speedy" first with a position increment of 1 and then "fast" and "quick" with position increments of 0 since they are represented in the same position.

The offset set values start and end should be byte offsets, not character offsets. This makes it easy to use those offsets to quickly access the token in the input string and also to insert highlighting tags when necessary.

text:the main text for the token.
start:the start offset of the token in bytes.
end:the end offset of the token in bytes.
pos_inc:the position increment of a token. See above.
return:a newly created and assigned Token object

Public Instance methods

Used to compare two tokens. Token is extended by Comparable so you can also use +<+, +>+, +<=+, +>=+ etc. to compare tokens.

Tokens are sorted by the position in the text at which they occur, ie the start offset. If two tokens have the same start offset, (see pos_inc=) then, they are sorted by the end offset and then lexically by the token text.

End byte-position of this token

Set end byte-position of this token

Position Increment for this token

Set the position increment. This determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching.

The default value is 1.

Some common uses for this are:

  • Set it to zero to put multiple terms in the same position. This is useful if, e.g., a word has multiple stems. Searches for phrases including either stem will match. In this case, all but the first stem‘s increment should be set to zero: the increment of the first instance should be one. Repeating a token with an increment of zero can also be used to boost the scores of matches on that token.
  • Set it to values greater than one to inhibit exact phrase matches. If, for example, one does not want phrases to match across removed stop words, then one could build a stop word filter that removes stop words and also sets the increment to the number of stop words removed before each non-stop word. Then exact phrase queries will only match when the terms occur with no intervening stop words.

Start byte-position of this token

Set start byte-position of this token

Returns the text that this token represents

Set the text for this token.

Return a string representation of the token

[Validate]