Class Scrubyt::Pattern
In: lib/scrubyt/core/scraping/pattern.rb
Parent: Object

Group more filters into one

Server as an umbrella for filters which are conceptually extracting the same thing - for example a price or a title or …

Sometimes the same piece of information can not be extracted with one filter across more result instances (for example a price has an XPath in record n, but since in record n+1 has a discount price as well, the real price is pushed to a different XPath etc) - in this case the more filters which extract the same thing are hold in the same pattern.

Methods

Constants

VALID_PATTERN_TYPES = [:tree, :attribute, :regexp, :detail_page, :download, :html_subtree, :constant, :script, :text]   # a root pattern represents a (surprise!) root pattern PATTERN_TYPE_ROOT = :PATTERN_TYPE_ROOT # a tree pattern represents a HTML region PATTERN_TYPE_TREE = :PATTERN_TYPE_TREE # represents an attribute of the node extracted by the parent pattern PATTERN_TYPE_ATTRIBUTE = :PATTERN_TYPE_ATTRIBUTE # represents a pattern which filters its output with a regexp PATTERN_TYPE_REGEXP = :PATTERN_TYPE_REGEXP # represents a pattern which crawls to the detail page and extracts information from there PATTERN_TYPE_DETAIL_PAGE = :PATTERN_TYPE_DETAIL_PAGE # represents a download pattern PATTERN_TYPE_DOWNLOAD = :PATTERN_TYPE_DOWNLOAD # write out the HTML subtree beginning at the matched element PATTERN_TYPE_HTML_SUBTREE = :PATTERN_TYPE_HTML_SUBTREE
VALID_PATTERN_EXAMPLE_TYPES = [:determine, :xpath]   :determine - default value, represent that type of example need determine :string - represent node with example type EXAMPLE_TYPE_STRING
VALID_OUTPUT_TYPES = [:model, :temp, :page_list]   Model pattern are shown in the output
    OUTPUT_TYPE_MODEL = :OUTPUT_TYPE_MODEL
    #Temp patterns are skipped in the output (their ancestors are appended to the parent
    #of the pattrern which was skipped
    OUTPUT_TYPE_TEMP = :OUTPUT_TYPE_TEMP
PATTERN_OPTIONS = [:generalize, :type, :output_type, :references, :limit, :default, :resolve, :except, :example_type]   These options can be set upon wrapper creation
VALID_OPTIONS = PATTERN_OPTIONS + Scrubyt::CompoundExample::DESCRIPTORS + Scrubyt::ResultNode::OUTPUT_OPTIONS

Attributes

children  [RW] 
constraints  [RW] 
extractor  [RW] 
filters  [RW] 
indices_to_extract  [RW] 
modifier_calls  [RW] 
name  [RW] 
next_page_url  [R] 
options  [RW] 
parent  [RW] 
referenced_extractor  [RW] 
referenced_pattern  [RW] 
result_indexer  [R] 

Public Class methods

Public Instance methods

Check whether the currently created pattern is a detail pattern (i.e. it refrences a subextractor). Also check if the currently created pattern is an ancestor of a detail pattern , and store this in a hash if yes (to be able to traverse the pattern structure on detail pages as well).

Shortcut patterns, as their name says, are a shortcut for creating patterns from predefined rules; for example:

  detail_url

  is equivalent to

  detail_url 'href', type => :attribute

i.e. the system figures out on it‘s own that because of the postfix, the example should be looked up (but it should never override the user input!) another example (will be available later):

 every_img

 is equivivalent to

 every_img '//img'

Dispatcher function; The class was already too big so I have decided to factor out some methods based on their functionality (like output, adding constraints) to utility classes.

The second function besides dispatching is to lookup the results in an evaluated wrapper, for example

 camera_data.item[1].item_name[0]

[Validate]