au.id.jericho.lib.html
Class TextExtractor

java.lang.Object
  extended by TextExtractor
All Implemented Interfaces:
CharStreamSource

public class TextExtractor
extends java.lang.Object
implements CharStreamSource

Extracts the textual content from HTML markup.

The output is ideal for feeding into a text search engine such as Apache Lucene, especially when the IncludeAttributes property has been set to true.

Use one of the following methods to obtain the output:

The process removes all of the tags and decodes the result, collapsing all white space. A space character is included in the output where a normal tag is present in the source, unless the tag belongs to an inline-level element. An exception to this is the BR element, which is also converted to a space despite being an inline-level element.

Text inside SCRIPT and STYLE elements contained within this segment is ignored.

Setting the ExcludeNonHTMLElements property results in the exclusion of any content within a non-HTML element.

See the excludeElement(StartTag) method for details on how to implement a more complex mechanism to determine whether the content of each Element is to be excluded from the output.

All tags that are not normal tags, such as server tags, comments etc., are removed from the output without adding whitespace to the output.

Note that segments on which the Segment.ignoreWhenParsing() method has been called are treated as text rather than markup, resulting in their inclusion in the output. To remove specific segments before extracting the text, create an OutputDocument and call its remove(Segment) or replaceWithSpaces(int begin, int end) method for each segment to be removed. Then create a new source document using new Source(outputDocument.toString()) and perform the text extraction on this new source object.

Extracting the text from an entire Source object performs a full sequential parse automatically.

To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the Renderer class instead.

Example:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".


Constructor Summary
TextExtractor(Segment segment)
          Constructs a new TextExtractor based on the specified Segment.
 
Method Summary
 boolean excludeElement(StartTag startTag)
          Indicates whether the text inside the Element of the specified start tag should be excluded from the output.
 boolean getConvertNonBreakingSpaces()
          Indicates whether non-breaking space (&nbsp;) character entity references are converted to spaces.
 long getEstimatedMaximumOutputLength()
          Returns the estimated maximum number of characters in the output, or -1 if no estimate is available.
 boolean getExcludeNonHTMLElements()
          Indicates whether the content of non-HTML elements is excluded from the output.
 boolean getIncludeAttributes()
          Indicates whether the values of title, alt, label, and summary, and content attributes of normal tags are to be included in the output.
 TextExtractor setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
          Sets whether non-breaking space (&nbsp;) character entity references are converted to spaces.
 TextExtractor setExcludeNonHTMLElements(boolean excludeNonHTMLElements)
          Sets whether the content of non-HTML elements is excluded from the output.
 TextExtractor setIncludeAttributes(boolean includeAttributes)
          Sets whether the values of title, alt, label, summary, and content attributes of normal tags are to be included in the output.
 java.lang.String toString()
          Returns the output as a string.
 void writeTo(java.io.Writer writer)
          Writes the output to the specified Writer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TextExtractor

public TextExtractor(Segment segment)
Constructs a new TextExtractor based on the specified Segment.

Parameters:
segment - the segment from which the text will be extracted.
See Also:
Segment.getTextExtractor()
Method Detail

writeTo

public void writeTo(java.io.Writer writer)
             throws java.io.IOException
Description copied from interface: CharStreamSource
Writes the output to the specified Writer.

Specified by:
writeTo in interface CharStreamSource
Parameters:
writer - the destination java.io.Writer for the output.
Throws:
java.io.IOException - if an I/O exception occurs.

getEstimatedMaximumOutputLength

public long getEstimatedMaximumOutputLength()
Description copied from interface: CharStreamSource
Returns the estimated maximum number of characters in the output, or -1 if no estimate is available.

The returned value should be used as a guide for efficiency purposes only, for example to set an initial StringBuffer capacity. There is no guarantee that the length of the output is indeed less than this value, as classes implementing this method often use assumptions based on typical usage to calculate the estimate.

Specified by:
getEstimatedMaximumOutputLength in interface CharStreamSource
Returns:
the estimated maximum number of characters in the output, or -1 if no estimate is available.

toString

public java.lang.String toString()
Description copied from interface: CharStreamSource
Returns the output as a string.

Specified by:
toString in interface CharStreamSource
Overrides:
toString in class java.lang.Object
Returns:
the output as a string.

setConvertNonBreakingSpaces

public TextExtractor setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
Sets whether non-breaking space (&nbsp;) character entity references are converted to spaces.

The default value is true.

Parameters:
convertNonBreakingSpaces - specifies whether non-breaking space (&nbsp;) character entity references are converted to spaces.
Returns:
this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getConvertNonBreakingSpaces()

getConvertNonBreakingSpaces

public boolean getConvertNonBreakingSpaces()
Indicates whether non-breaking space (&nbsp;) character entity references are converted to spaces.

See the setConvertNonBreakingSpaces(boolean) method for a full description of this property.

Returns:
true if non-breaking space (&nbsp;) character entity references are converted to spaces, otherwise false.

setIncludeAttributes

public TextExtractor setIncludeAttributes(boolean includeAttributes)
Sets whether the values of title, alt, label, summary, and content attributes of normal tags are to be included in the output.

The value of a content attribute is only included if a name attribute is also present, as the content attribute of a META tag only contains human readable text if the name attribute is used as opposed to an http-equiv attribute.

The default value is false.

Parameters:
includeAttributes - specifies whether the attribute values are included in the output.
Returns:
this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getIncludeAttributes()

getIncludeAttributes

public boolean getIncludeAttributes()
Indicates whether the values of title, alt, label, and summary, and content attributes of normal tags are to be included in the output.

See the setIncludeAttributes(boolean) method for a full description of this property.

Returns:
true if the attribute values are to be included in the output, otherwise false.

setExcludeNonHTMLElements

public TextExtractor setExcludeNonHTMLElements(boolean excludeNonHTMLElements)
Sets whether the content of non-HTML elements is excluded from the output.

The default value is false, meaning that content from all elements meeting the other criteria is included.

Parameters:
excludeNonHTMLElements - specifies whether content non-HTML elements is excluded from the output.
Returns:
this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
See Also:
getExcludeNonHTMLElements()

getExcludeNonHTMLElements

public boolean getExcludeNonHTMLElements()
Indicates whether the content of non-HTML elements is excluded from the output.

See the setExcludeNonHTMLElements(boolean) method for a full description of this property.

Returns:
true if the content of non-HTML elements is excluded from the output, otherwise false.

excludeElement

public boolean excludeElement(StartTag startTag)
Indicates whether the text inside the Element of the specified start tag should be excluded from the output.

During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its associated element should be excluded from the output.

The default implementation of this method is to always return false, so that every element is included, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.

All elements nested inside an excluded element are also implicitly excluded, as are all SCRIPT and STYLE elements. Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.

Example:
To extract the text from a segment, excluding any text inside elements with the attribute class="NotIndexed":

TextExtractor textExtractor=new TextExtractor(segment) {
    public boolean excludeElement(StartTag startTag) {
        return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
    }
};
String extractedText=textExtractor.toString();

Parameters:
startTag - the start tag of the element to check for inclusion.
Returns:
if the text inside the Element of the specified start tag should be excluded from the output, otherwise false.