Next: , Previous: Fields, Up: Tutorial


1.8 Fields and XML

XML is a good context for exploring field searching. The following examples make use of the xml document type, which supports nested fields (i.e. fields within fields).

Suppose we index the following XML data, contained in a file called, jones.xml:

     <Document>
        <Author>
           <Name>
              <FirstName> Tom </FirstName>
              <LastName> Jones </LastName>
           </Name>
        </Author>
     </Document>

with the following command:

     $ af -i -d mydb -C -t xml -v jones.xml

The xml document type views this document as containing two words, `Tom' and `Jones', each located at a certain field path within the document:

     /Document/_c/Author/_c/Name/_c/FirstName/_c/Tom
     /Document/_c/Author/_c/Name/_c/LastName/_c/Jones

The character, `/', separates the field names, and in this case each field except for `_c' corresponds to an XML element. (Below we shall see an example in which a field corresponds to an XML attribute.) The `_c' is a special field defined by xml that means, “element content.” Thus the following search:

     $ af -s -d mydb -q '/Document/_c/Author/_c/Name/_c/LastName/_c/Jones'

will return jones.xml as a matching result. These queries also will return a positive match:

     $ af -s -d mydb -q '/.../Document/_c/Author/_c/Name/_c/LastName/_c/Jones'
     $ af -s -d mydb -q '/.../_c/Author/_c/Name/_c/LastName/_c/Jones'
     $ af -s -d mydb -q '/.../Author/_c/Name/_c/LastName/_c/Jones'
     $ af -s -d mydb -q '/.../_c/Name/_c/LastName/_c/Jones'
     $ af -s -d mydb -q '/.../Name/_c/LastName/_c/Jones'
     $ af -s -d mydb -q '/.../_c/LastName/_c/Jones'
     $ af -s -d mydb -q '/.../LastName/_c/Jones'
     $ af -s -d mydb -q '/.../_c/Jones'
     $ af -s -d mydb -q '/.../Jones'
     $ af -s -d mydb -q 'Jones'

The `...' is defined by Amberfish as, “a sequence of any zero or more fields.” A `/.../' that begins a field path can be left out completely. For example, these two queries yield the same results:

     $ af -s -d mydb -q '/.../LastName/_c/Jones'
     $ af -s -d mydb -q 'LastName/_c/Jones'

The `...' can be used anywhere within a field path. For example, the following queries match jones.xml:

     $ af -s -d mydb -q '/Document/_c/Author/_c/Name/.../Jones'
     $ af -s -d mydb -q 'Name/.../LastName/.../Jones'

The first of the two examples above will match `Jones' anywhere within the author's name, not necessarily only his last name. The second matches only a last name of Jones, but it need not be the author; for example, it would match a document containing the following fragment:

     <Bibliography>
        <Reference Type="book">
           <Title> Text searching the old fashioned way. </Title>
           <Name>
              <FirstName> Indiana </FirstName>
              <LastName> Jones </LastName>
           </Name>
        </Reference>
     </Bibliography>

Other queries that would match the above fragment:

     $ af -s -d mydb -q 'Reference/_a/Type/book'
     $ af -s -d mydb -q 'Reference/_a/.../book'
     $ af -s -d mydb -q 'Reference/.../book'

The `_a' is another special field defined by xml that means, “attribute content.” Thus `_c' and `_a' allow one to distinguish between attribute and element searching if desired. In constructing queries for this document type, it is always necessary to specify `_c', `_a', or `...' after an element field name and before the next field name or the search word.

Phrase searching with fields is done this way:

     $ af -s -d mydb -q 'Title/.../"text searching"'

or in a multiple term expression:

     $ af -s -d mydb -Q 'Title/.../"text searching" &
                         Name/.../Indiana & Name/.../Jones'