www.openlinksw.com
docs.openlinksw.com

Book Home

Contents
Preface

RDF Database and SPARQL

Data Representation
RDF and SPARQL API and SQL
IRI Dereferencing
RDF Views -- Mapping Relational Data to RDF
SPARQL Implementation
RDF Inference in Virtuoso
Using Full Text Search in SPARQL
Specifying What to Index Time of Indexing Free-Text Indexes on RDF Views
Aggregates in SPARQL
Virtuoso SPARQL Query Service

14.7. Using Full Text Search in SPARQL

Virtuoso's triple store supports optional full text indexing of RDF object values since version 5.0. It is possible to declare that objects of triples with a given predicate or graph get indexed. The graphs and triples may be enumerated or a wildcard may be used.

The triples for which a full text index entry exists can be found using the bif:contains or related filters and predicates.

For example, the query:

select *
  from <people>
 where {?s foaf:Name ?name . ?name bif_contains "rich*" .}

would match all subjects whose foaf:Name contained a word starting with Rich. This would match Richard, Richie etc.

If the bif:contains or related predicate is applied to an object that is not a string or is not the object of an indexed triple, no match will be found.

The syntax for text patterns is identical to the syntax for the SQL contains predicate.

The SPARQL/SQL optimizer determines whether the text pattern will be used to drive the query or whether it will filter results after other conditions are applied first. As opposed to bif:contains, regexp matching never drives the query or makes us of an index, thus regexps are in practice checked after other conditions.

14.7.1. Specifying What to Index

Whether the object of a given triple is indexed in the text index depends on indexing rules. If at least one indexing rule matches the triple, the object gets indexed if the object is a string. An indexing rule specifies a graph and a predicate. Either may be an IRI or NULL, in which case it matches all IRI's.

Rules also have a 'reason', which can be used to group rules into application-specific sets. A triple will stop being indexed only after all rules mandating its indexing are removed. When an application requires indexing a certain set of triples, rules are added to for the purpose. These rules are tagged with the name of the application as their reason. When an application no longer requires indexing, the rules belonging to this application can be removed. This will not turn off indexing if another application still needs certain triples to stay indexed.

Indexing is enabled/disable for specific graph/predicate combinations with:

create function DB.DBA.RDF_OBJ_FT_RULE_ADD
  (in rule_g varchar, in rule_p varchar, in reason varchar) returns integer
create function DB.DBA.RDF_OBJ_FT_RULE_DEL
  (in rule_g varchar, in rule_p varchar, in reason varchar) returns integer

The first function adds a rule. The two first arguments are the text representation of the IRI's for the graph and predicate. If NULL is given then all graph's or predicates match. Specifying both as NULL means that all string valued objects will be added to a text index.

The second function reverses the effect of the first. Only a rule that actually has been added can be deleted. Thus one cannot say that all except a certain enumerated set should be indexed.

The reason argument is an arbitrary string identifying the application that needs this rule. Two applications can add the same rule. Removing one of them will still keep the rule in effect. If an object is indexed due to more than one rule the index data remain free from duplicates, neither index size nor speed is affected.

If DB.DBA.RDF_OBJ_FT_RULE_ADD detects that the DB.DBA.RDF_QUAD contains quads whose graphs and/or predicates match to the new rule but not indexed before due to other rules then these quads are indexed automaticaly. However the function DB.DBA.RDF_OBJ_FT_RULE_DEL does not remove indexing data about related objects. Thus the presence of indexing data about an object does not imply that it is necessarily used in some quad that matches to some rule.

Functions return one if the rule is added or deleted and zero if the call was redundand (the rule has been added before or there's no rule to delete).


-- We load Tim Berners-Lee's FOAF file into a graph called people.

DB.DBA.RDF_LOAD_RDFXML (http_get ('http://www.w3.org/People/Berners-Lee/card#i'), 'no', 'people');

-- We check how many triples we got.

select count (*) from (sparql select * from <people> where {?s ?p ?o})f;

-- We specify that all string objects in the graph people should be text indexed.

DB.DBA.RDF_OBJ_FT_RULE_ADD ('people', null, 'people');

-- We update the text index. See below on how to keep the text index automatically updated.

DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ ();


-- We  ask for the subjects and predicates of all triples in <people> where the object is a string which contains a word beginning with TIM.

sparql select * from <people> where { ?s ?p ?o . ?o bif:contains '"TIM*"' .};

s                p                o
VARCHAR          VARCHAR          VARCHAR
_______________________________________________________________________________

http://no        http://purl.org/dc/elements/1.1/title  Tim Berners-Lee's FOAF file
http://www.w3.org/People/Berners-Lee/card#i  http://xmlns.com/foaf/0.1/name  Timothy Berners-Lee
http://www.w3.org/People/Berners-Lee/card#i  http://www.w3.org/2000/01/rdf-schema#label  Tim Berners-Lee
http://www.w3.org/People/Berners-Lee/card#i  http://xmlns.com/foaf/0.1/givenname  Timothy
http://www.w3.org/People/Berners-Lee/card#i  http://xmlns.com/foaf/0.1/nick  TimBL
http://www.w3.org/People/Berners-Lee/card#i  http://xmlns.com/foaf/0.1/nick  timbl
http://dig.csail.mit.edu/breadcrumbs/blog/4  http://purl.org/dc/elements/1.1/title  timbl's blog

7 Rows. -- 2 msec.

The below query is identical with the above but uses a different syntax. The filter syntax is more flexible in that it allows passing extra options to the contains predicate. These may be useful in the future.

sparql select * from <people> where { ?s ?p ?o . filter (bif:contains(?o,  '"TIM*"')) };

Note:

It is better to upgrade to the latest version of Virtuoso before adding free-text rules for the first time. The upgrade is especially advised in case of big amounts of texts to be indexed. The reason is that the free-text index on RDF may be changed in future versions and automatic upgrade of an existing index data into new format may take much more time than indexing from scratch.

The table DB.DBA.RDF_OBJ_FT_RULES stores list of free-text index configuration rules.

create table DB.DBA.RDF_OBJ_FT_RULES (
  ROFR_G varchar not null,       -- specific graph IRI or NULL for "all graphs"
  ROFR_P varchar not null,       -- specific predicate IRI or NULL for "all predicates"
  ROFR_REASON varchar not null,  -- identification string of a creator, preferably human-readable
  primary key (ROFR_G, ROFR_P, ROFR_REASON) );

Applications may read from this table but they should not write directly. Numerous duplicates in rules does not affect speed of free-text index operations because the content of the table is cached in memory in a special way, Unlike the use of configuration functions, direct write to the table will not update that cache.

The table is convenient to search for rules added by a given application. If a unique identification string is used during installation of an application when rules are added then it's easy to remove that rules by an uninstall.


14.7.2. Time of Indexing

The triple store's text index is by default in manual batch mode. This means that changes in triples are periodically reflected in the text index but are not maintained in strict synchrony. This is much more efficient than keeping the indices in constant synchrony. This setting may be altered with the db.dba.vt_batch_update stored procedure.

To force synchronization of the RDF text index, use:

DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ ();

To set the text index to follow the triples in real time, use:

DB.DBA.VT_BATCH_UPDATE ('DB.DBA.RDF_OBJ', 'ON', null);

To set the text index to be updated every 10 minutes, use:

db.dba.vt_batch_update ('DB.DBA.RDF_OBJ', 'ON', 10);

To make the update always manual, specify NULL as the last argument above.

Additional problem related to free-text index of DB.DBA.RDF_QUAD is that some applications (e.g. import of billions of triples) may set triggers off. This will make free-text index data incomplete. Call of procedure DB.DBA.RDF_OBJ_FT_RECOVER () will insert all mising free-text index items by drop and re-insert every existing free-text index rule.


14.7.3. Free-Text Indexes on RDF Views

If an O field of a quad map pattern gets its value from a database column that have a free text index then this index can be used in SPARQL for effecient text search. As a variant, the free-text index of an additional table may be used.

If a statement of quad map pattern declaration starts with declaration of table aliases, declaration of table alias may include name of table column that should have a text index. Consider possible use of free-text index on content of DAV resources stored in DAV system tables of Virtuoso:

prefix mydav: <...>
create quad storage mydav:metadata
from WS.WS.SYS_DAV_RES as dav_resource text literal RES_CONTENT
...
  {
    ...
    mydav:resource-iri (dav_resource.RES_FULL_PATH)
        a mydav:resource ;
        mydav:resource-content dav_resource.RES_CONTENT ;
        mydav:resource-mime-type dav_resource.RESTYPE ;
    ...
  }

The clause text literal RES_CONTENT grants the SPARQL compiler permission to use free-text index for objects that are literals composed from column dav_resource.RES_CONTENT; this clause also choose between text literal (supports contains() predicate only) and text xml literal (supports both contains() and xcontains()) text indexes. It is important to understand that free-text index will produce results using raw relatioinal data. If a literal class transformation changes the text stored in the column then these changes are ignored by free-text search. E.g., a transformation concatenates a word to the value of the column, but the free-text search will not find this word.

The free-text index may be used in a more sophisticated way. Consider a built-in table DB.DBA.RDF_QUAD that does not have a free-text index. Moreover, the table does not contain full values of all objects; the O column contains "short enough" values inlined, but long and special values are represened by links to DB.DBA.RDF_OBJ table. The RDF_OBJ table, however, has free-text index that can be used. The full declaration of built-in default mapping for default storage could be written this way:

-- Important! Do not try to execute on live system
-- without prior changing names of storage and quad map pattern!

sparql
create virtrdf:DefaultQuadMap as
graph rdfdf:default-iid-nonblank (DB.DBA.RDF_QUAD.G)
subject rdfdf:default-iid (DB.DBA.RDF_QUAD.S)
predicate rdfdf:default-iid-nonblank (DB.DBA.RDF_QUAD.P)
object rdfdf:default (DB.DBA.RDF_QUAD.O)

create quad storage virtrdf:DefaultQuadStorage
from DB.DBA.RDF_QUAD as physical_quad
from DB.DBA.RDF_OBJ as physical_obj text xml literal RO_DIGEST of (physical_quad.O)
where (^{physical_quad.}^.O = ^{physical_obj.}^.RO_DIGEST)
  {
    create virtrdf:DefaultQuadMap as
      graph rdfdf:default-iid-nonblank (physical_quad.G)
      subject rdfdf:default-iid (physical_quad.S)
      predicate rdfdf:default-iid-nonblank (physical_quad.P)
      object rdfdf:default (physical_quad.O) .
  }
;

The reference to the free-text index is extended by clause of (physical_quad.O). This means that the free-text on DB.DBA.RDF_OBJ.RO_DIGEST will be used when the object value comes from physical_quad.O as if physical_quad.O were indexed itself. If a SPARQL query invokes virtrdf:DefaultQuadMap but contains no free-text criteria then only DB.DBA.RDF_QUAD appears in the final SQL statement and no join with DB.DBA.RDF_OBJ is made. Adding a free-text predicate will add DB.DBA.RDF_OBJ to the list of source tables and a join condition for DB.DBA.RDF_QUAD.O and DB.DBA.RDF_OBJ.RO_DIGEST; and it will add contains (RO_DIGEST, ...) predicate, not contains (O, ...). As a result, "you pay only for what you use": adding free-text index to the declaration does not add tables to the query unless the index is actually used.

Boolean functions bif:contains and bif:xcontains are used for objects that come from RDF Views as well as for regular "physical" triples. Every function gets two arguments and returns a boolean value. The first argument is an local variable. The argument variable should be used as an object field in the group pattern where the filter condition is placed. Moreover, the occurence of variable in object field should be placed before the filter. If there are many occurences of the variable in object fields then the free-text search is associated with the rightmost occurence that is still to the left from the filter. The triple pattern that contains the rightmost occurence is called "intake" of free-text search. When SPARQL compiler chooses appropriate quad map patterns that may generate data matching intake triple pattern it skips quad map patterns that have no declared free-text indexes, because nothing can be found by free-text search in data that have no free-text index. Every quad map pattern that has free-text pattern will finally produce an invocation of SQL contains or xcontains predicate, so the whole result of free-text search may be a union of free-text searches from different quad map patterns.

The described logic is important only in really complicated cases whereas simple queries are self-evident:

select * from <my-dav-graph>
where {
    ?resource a mydav:resource ;
        mydav:resource-content ?text .
    filter (bif:contains (?text, "hello and world")) }

or, more compact,

select * from <my-dav-graph>
where {
    ?resource a mydav:resource ;
        mydav:resource-content ?text .
    ?text bif:contains "hello and world" . }