org.exist.storage
Class NativeTextEngine

java.lang.Object
  extended byjava.util.Observable
      extended byorg.exist.storage.TextSearchEngine
          extended byorg.exist.storage.NativeTextEngine
All Implemented Interfaces:
ContentLoadingObserver

public class NativeTextEngine
extends TextSearchEngine
implements ContentLoadingObserver

This class is responsible for fulltext-indexing. Text-nodes are handed over to this class to be fulltext-indexed. Method storeText() is called by RelationalBroker whenever it finds a TextNode. Method getNodeIDsContaining() is used by the XPath-engine to process queries where a fulltext-operator is involved. The class keeps two database tables: table dbTokens stores the words found with their unique id. Table invertedIndex contains the word occurrences for every word-id per document. TODO: store node type (attribute or text) with each entry

Author:
Wolfgang Meier

Field Summary
static byte ATTRIBUTE_SECTION
           
static int MAX_TOKEN_LENGTH
          Length limit for the tokens
static byte TEXT_SECTION
           
 
Fields inherited from class org.exist.storage.TextSearchEngine
PROPERTY_INDEX_NUMBERS, PROPERTY_STEM, PROPERTY_STORE_TERM_FREQUENCY, PROPERTY_TOKENIZER
 
Constructor Summary
NativeTextEngine(DBBroker broker, Configuration config, BFile db)
           
 
Method Summary
 boolean close()
           
static boolean containsWildcards(java.lang.String str)
          Checks if the given string could be a regular expression.
 void dropIndex(Collection collection)
          Drop all index entries for the given collection.
 void dropIndex(DocumentImpl document)
          Drop all index entries for the given document.
 void endElement(int xpathType, ElementImpl node, java.lang.String content)
          store and index given element (called storeElement before)
 void flush()
          writes the pending items, for the current document's collection
 java.lang.String[] getIndexTerms(DocumentSet docs, TermMatcher matcher)
           
 NodeSet getNodes(XQueryContext context, DocumentSet docs, NodeSet contextSet, TermMatcher matcher, java.lang.CharSequence startTerm)
           
 NodeSet getNodesContaining(XQueryContext context, DocumentSet docs, NodeSet contextSet, java.lang.String expr, int type, boolean matchAll)
          For each of the given search terms and each of the documents in the document set, return a node-set of matching nodes.
 NodeSet getNodesExact(XQueryContext context, DocumentSet docs, NodeSet contextSet, java.lang.String expr)
          Get all nodes whose content exactly matches the give expression.
 int getTrackMatches()
           
 void printStatistics()
           
 void reindex(DocumentImpl document, StoredNode node)
          Reindexes all pending items for the specified document.
 void remove()
          remove all pending modifications, for the current document.
 void removeElement(ElementImpl node, NodePath currentPath, java.lang.String content)
          Mark given Element for removal; added entries are written to the list of pending entries.
 Occurrences[] scanIndexTerms(DocumentSet docs, NodeSet contextSet, java.lang.String start, java.lang.String end)
          Queries the fulltext index to retrieve information on indexed words contained in the index for the current collection.
 void setDocument(DocumentImpl document)
          set the current document; generally called before calling an operation
 void setTrackMatches(int flags)
           
 void startElement(ElementImpl impl, NodePath currentPath, boolean index)
          corresponds to SAX function of the same name
static boolean startsWithWildcard(java.lang.String str)
           
 void storeAttribute(AttrImpl node, NodePath currentPath, boolean fullTextIndexSwitch)
          store and index given attribute
 void storeAttribute(FulltextIndexSpec indexSpec, AttrImpl attr)
          Indexes the tokens contained in an attribute.
 void storeAttribute(RangeIndexSpec spec, AttrImpl node)
           
 void storeText(FulltextIndexSpec indexSpec, StoredNode parent, java.lang.String text)
           
 void storeText(FulltextIndexSpec indexSpec, TextImpl text, boolean noTokenizing)
          Indexes the tokens contained in a text node.
 void storeText(TextImpl node, NodePath currentPath, boolean fullTextIndexSwitch)
          store and index given text node
 void sync()
          triggers a cache sync, i.e.
 java.lang.String toString()
           
 
Methods inherited from class org.exist.storage.TextSearchEngine
getNodesContaining, getTokenizer
 
Methods inherited from class java.util.Observable
addObserver, countObservers, deleteObserver, deleteObservers, hasChanged, notifyObservers, notifyObservers
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

TEXT_SECTION

public static final byte TEXT_SECTION
See Also:
Constant Field Values

ATTRIBUTE_SECTION

public static final byte ATTRIBUTE_SECTION
See Also:
Constant Field Values

MAX_TOKEN_LENGTH

public static final int MAX_TOKEN_LENGTH
Length limit for the tokens

See Also:
Constant Field Values
Constructor Detail

NativeTextEngine

public NativeTextEngine(DBBroker broker,
                        Configuration config,
                        BFile db)
Method Detail

containsWildcards

public static final boolean containsWildcards(java.lang.String str)
Checks if the given string could be a regular expression.

Parameters:
str - The string

startsWithWildcard

public static final boolean startsWithWildcard(java.lang.String str)

getTrackMatches

public int getTrackMatches()
Overrides:
getTrackMatches in class TextSearchEngine

setTrackMatches

public void setTrackMatches(int flags)
Overrides:
setTrackMatches in class TextSearchEngine

setDocument

public void setDocument(DocumentImpl document)
Description copied from interface: ContentLoadingObserver
set the current document; generally called before calling an operation

Specified by:
setDocument in interface ContentLoadingObserver

storeAttribute

public void storeAttribute(FulltextIndexSpec indexSpec,
                           AttrImpl attr)
Indexes the tokens contained in an attribute.

Specified by:
storeAttribute in class TextSearchEngine
Parameters:
attr - The attribute to be indexed
indexSpec -

storeText

public void storeText(FulltextIndexSpec indexSpec,
                      TextImpl text,
                      boolean noTokenizing)
Indexes the tokens contained in a text node.

Specified by:
storeText in class TextSearchEngine
Parameters:
indexSpec - The index configuration
text - The text node to be indexed
noTokenizing - if true, given text is indexed as a single token if false, it is tokenized before being indexed

storeText

public void storeText(FulltextIndexSpec indexSpec,
                      StoredNode parent,
                      java.lang.String text)
Specified by:
storeText in class TextSearchEngine

storeAttribute

public void storeAttribute(RangeIndexSpec spec,
                           AttrImpl node)

storeAttribute

public void storeAttribute(AttrImpl node,
                           NodePath currentPath,
                           boolean fullTextIndexSwitch)
Description copied from interface: ContentLoadingObserver
store and index given attribute

Specified by:
storeAttribute in interface ContentLoadingObserver

storeText

public void storeText(TextImpl node,
                      NodePath currentPath,
                      boolean fullTextIndexSwitch)
Description copied from interface: ContentLoadingObserver
store and index given text node

Specified by:
storeText in interface ContentLoadingObserver

startElement

public void startElement(ElementImpl impl,
                         NodePath currentPath,
                         boolean index)
Description copied from interface: ContentLoadingObserver
corresponds to SAX function of the same name

Specified by:
startElement in interface ContentLoadingObserver

endElement

public void endElement(int xpathType,
                       ElementImpl node,
                       java.lang.String content)
Description copied from interface: ContentLoadingObserver
store and index given element (called storeElement before)

Specified by:
endElement in interface ContentLoadingObserver

removeElement

public void removeElement(ElementImpl node,
                          NodePath currentPath,
                          java.lang.String content)
Description copied from interface: ContentLoadingObserver
Mark given Element for removal; added entries are written to the list of pending entries. ContentLoadingObserver.flush() is called later to flush all pending entries.
Notes: changed name from storeElement()

Specified by:
removeElement in interface ContentLoadingObserver

sync

public void sync()
Description copied from interface: ContentLoadingObserver
triggers a cache sync, i.e. forces to write out all cached pages. sync() is called from time to time by the background sync daemon.

Specified by:
sync in interface ContentLoadingObserver

flush

public void flush()
Description copied from interface: ContentLoadingObserver
writes the pending items, for the current document's collection

Specified by:
flush in interface ContentLoadingObserver
Specified by:
flush in class TextSearchEngine

reindex

public void reindex(DocumentImpl document,
                    StoredNode node)
Description copied from interface: ContentLoadingObserver
Reindexes all pending items for the specified document.

Specified by:
reindex in interface ContentLoadingObserver
Specified by:
reindex in class TextSearchEngine
Parameters:
document -
node -

remove

public void remove()
Description copied from interface: ContentLoadingObserver
remove all pending modifications, for the current document.

Specified by:
remove in interface ContentLoadingObserver

dropIndex

public void dropIndex(Collection collection)
Description copied from interface: ContentLoadingObserver
Drop all index entries for the given collection.

Specified by:
dropIndex in interface ContentLoadingObserver
Specified by:
dropIndex in class TextSearchEngine
Parameters:
collection -

dropIndex

public void dropIndex(DocumentImpl document)
Description copied from interface: ContentLoadingObserver
Drop all index entries for the given document.

Specified by:
dropIndex in interface ContentLoadingObserver
Specified by:
dropIndex in class TextSearchEngine
Parameters:
document -

getNodesContaining

public NodeSet getNodesContaining(XQueryContext context,
                                  DocumentSet docs,
                                  NodeSet contextSet,
                                  java.lang.String expr,
                                  int type,
                                  boolean matchAll)
                           throws TerminatedException
Description copied from class: TextSearchEngine
For each of the given search terms and each of the documents in the document set, return a node-set of matching nodes. The type-argument indicates if search terms should be compared using a regular expression. Valid values are DBBroker.MATCH_EXACT or DBBroker.MATCH_REGEXP.

Specified by:
getNodesContaining in class TextSearchEngine
Throws:
TerminatedException

getNodesExact

public NodeSet getNodesExact(XQueryContext context,
                             DocumentSet docs,
                             NodeSet contextSet,
                             java.lang.String expr)
                      throws TerminatedException
Get all nodes whose content exactly matches the give expression.

Throws:
TerminatedException

getNodes

public NodeSet getNodes(XQueryContext context,
                        DocumentSet docs,
                        NodeSet contextSet,
                        TermMatcher matcher,
                        java.lang.CharSequence startTerm)
                 throws TerminatedException
Specified by:
getNodes in class TextSearchEngine
Throws:
TerminatedException

getIndexTerms

public java.lang.String[] getIndexTerms(DocumentSet docs,
                                        TermMatcher matcher)
Specified by:
getIndexTerms in class TextSearchEngine

scanIndexTerms

public Occurrences[] scanIndexTerms(DocumentSet docs,
                                    NodeSet contextSet,
                                    java.lang.String start,
                                    java.lang.String end)
                             throws PermissionDeniedException
Description copied from class: TextSearchEngine
Queries the fulltext index to retrieve information on indexed words contained in the index for the current collection. Returns a list of Occurrences for all words contained in the index. If param end is null, all words starting with the string sequence param start are returned. Otherwise, the method returns all words that come after start and before end in lexical order.

Specified by:
scanIndexTerms in class TextSearchEngine
Throws:
PermissionDeniedException

close

public boolean close()
              throws DBException
Specified by:
close in class TextSearchEngine
Throws:
DBException

printStatistics

public void printStatistics()

toString

public java.lang.String toString()


Copyright (C) Wolfgang Meier. All rights reserved.