Class Scrubyt::Constraint
In: lib/scrubyt/core/scraping/constraint.rb
Parent: Object

Rejecting result instances based on further rules

The two most trivial problems with a set of rules is that they match either less or more instances than we would like them to. Constraints are a way to remedy the second problem: they serve as a tool to filter out some result instances based on rules. A typical example:

  • ensure_presence_of_ancestor_pattern consider this model:
      <book>
        <author>...</author>
        <title>...</title>
      </book>
    

If I attach the ensure_presence_of_ancestor_pattern to the pattern ‘book’ with values ‘author’ and ‘title’, only those books will be matched which have an author and a title (i.e.the child patterns author and title must extract something). This is a way to say ‘a book MUST have an author and a title’.

Methods

Constants

CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_PATTERN = 0   Different constraint types
CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_ATTRIBUTE = 1
CONSTRAINT_TYPE_ENSURE_ABSENCE_OF_ATTRIBUTE = 2
CONSTRAINT_TYPE_ENSURE_PRESENCE_OF_ANCESTOR_NODE = 3
CONSTRAINT_TYPE_ENSURE_ABSENCE_OF_ANCESTOR_NODE = 4

Attributes

target  [R] 
type  [R] 

Public Class methods

If this type of constraint is added to a pattern, the HTML node extracted by the pattern must NOT contain a HTML ancestor node called ‘node_name’ with the attribute set ‘attributes’.

"attributes" is an array of hashes, for example [{‘font’ => ‘red’}, {‘href’ => ‘www.google.com’}] in the case that more values have to be checked with the same key (e.g. ‘class’ => ‘small’ and ’ class’ => ‘wide’ it has to be written as [{‘class’ => [‘small’,’wide’]}]

"attributes" can be empty - in this case just the ‘node_name’ is checked

If this type of constraint is added to a pattern, the HTML node it targets must NOT have an attribute named "attribute_name" with the value "attribute_value"

If this type of constraint is added to a pattern, the HTML node extracted by the pattern must NOT contain a HTML ancestor node called ‘node_name’ with the attribute set ‘attributes’.

"attributes" is an array of hashes, for example [{‘font’ => ‘red’}, {‘href’ => ‘www.google.com’}] in the case that more values have to be checked with the same key (e.g. ‘class’ => ‘small’ and ’ class’ => ‘wide’ it has to be written as [{‘class’ => [‘small’,’wide’]}]

"attributes" can be empty - in this case just the ‘node_name’ is checked

If this type of constraint is added to a pattern, the HTML node it targets must have an attribute named "attribute_name" with the value "attribute_value"

If this type of constraint is added to a pattern, it must have an ancestor pattern (child pattern, or child pattern of a child pattern, etc.) denoted by "ancestor" ‘Has an ancestor pattern’ means that the ancestor pattern actually extracts something (just by looking at the wrapper model, the ancestor pattern is always present) Note that from this type of constraint there is no ‘ensure_absence’ version, since I could not think about an use case for that

We would not like these to be called from outside

Public Instance methods

Evaluate the constraint; if this function returns true, it means that the constraint passed, i.e. its filter will be added to the exctracted content of the pattern

[Validate]