Class | Scrubyt::Pattern |
In: |
lib/scrubyt/core/scraping/pattern.rb
|
Parent: | Object |
Server as an umbrella for filters which are conceptually extracting the same thing - for example a price or a title or …
Sometimes the same piece of information can not be extracted with one filter across more result instances (for example a price has an XPath in record n, but since in record n+1 has a discount price as well, the real price is pushed to a different XPath etc) - in this case the more filters which extract the same thing are hold in the same pattern.
VALID_PATTERN_TYPES | = | [:tree, :attribute, :regexp, :detail_page, :download, :html_subtree] | # a root pattern represents a (surprise!) root pattern PATTERN_TYPE_ROOT = :PATTERN_TYPE_ROOT # a tree pattern represents a HTML region PATTERN_TYPE_TREE = :PATTERN_TYPE_TREE # represents an attribute of the node extracted by the parent pattern PATTERN_TYPE_ATTRIBUTE = :PATTERN_TYPE_ATTRIBUTE # represents a pattern which filters its output with a regexp PATTERN_TYPE_REGEXP = :PATTERN_TYPE_REGEXP # represents a pattern which crawls to the detail page and extracts information from there PATTERN_TYPE_DETAIL_PAGE = :PATTERN_TYPE_DETAIL_PAGE # represents a download pattern PATTERN_TYPE_DOWNLOAD = :PATTERN_TYPE_DOWNLOAD # write out the HTML subtree beginning at the matched element PATTERN_TYPE_HTML_SUBTREE = :PATTERN_TYPE_HTML_SUBTREE | |
VALID_OUTPUT_TYPES | = | [:model, :temp] |
Model pattern are shown in the output
OUTPUT_TYPE_MODEL = :OUTPUT_TYPE_MODEL #Temp patterns are skipped in the output (their ancestors are appended to the parent #of the pattrern which was skipped OUTPUT_TYPE_TEMP = :OUTPUT_TYPE_TEMP |
|
PATTERN_OPTIONS | = | [:generalize, :type, :output_type, :references, :limit, :default, :resolve] | These options can be set upon wrapper creation | |
VALID_OPTIONS | = | PATTERN_OPTIONS + Scrubyt::CompoundExample::DESCRIPTORS + Scrubyt::ResultNode::OUTPUT_OPTIONS |
children | [RW] | |
constraints | [RW] | |
evaluation_context | [RW] | |
filters | [RW] | |
indices_to_extract | [RW] | |
last_result | [RW] | |
modifier_calls | [RW] | |
name | [RW] | |
next_page_url | [R] | |
options | [RW] | |
parent | [RW] | |
referenced_extractor | [RW] | |
referenced_pattern | [RW] | |
result_indexer | [R] | |
source_file | [RW] | |
source_proc | [RW] |
Check whether the currently created pattern is a detail pattern (i.e. it refrences a subextractor). Also check if the currently created pattern is an ancestor of a detail pattern , and store this in a hash if yes (to be able to traverse the pattern structure on detail pages as well).
Shortcut patterns, as their name says, are a shortcut for creating patterns from predefined rules; for example:
detail_url is equivalent to detail_url 'href', type => :attribute
i.e. the system figures out on it‘s own that because of the postfix, the example should be looked up (but it should never override the user input!) another example (will be available later):
every_img is equivivalent to every_img '//img'
Dispatcher function; The class was already too big so I have decided to factor out some methods based on their functionality (like output, adding constraints) to utility classes.
The second function besides dispatching is to lookup the results in an evaluated wrapper, for example
camera_data.item[1].item_name[0]