A document can consist of an entire file or a portion of a file. Amberfish records `begin' and `end' byte offsets for each document as demarcation of the document within the file that contains it. By default the whole file is treated as a single document. For example, a file called sample.txt that is 12000 bytes in size will be indexed as a single document with `begin' and `end' byte offsets, 0 and 12000, respectively.
The af tool includes the --split option as a method of instructing Amberfish that the files to be indexed contain multiple documents. The --split option is used to specify a string delimiter that indicates the boundaries between documents in a file. For example:
$ af -i -d mydb -C --split '#####' -v *.txt
As the files, *.txt, are indexed, they are scanned for the string, `#####'. Each instance of `#####' is interpreted as the beginning of a new document, and each new document is indexed individually. Note that each instance of `#####' is considered to be part of the document that follows it, as opposed to the document that precedes it. If the string delimiter happens to include text, rather than merely `#####', it will (normally) be indexed as text.
The division of files into multiple documents can be verified with af -l after the files have been added to the database (see Listing database information).
The af --fetch command prints a portion of a file to standard output:
$ af --fetch filename begin end
where `filename', `begin', and `end' are taken from the output of af -s (see Searching) or af -l (see Listing database information).
The --split option does not work with the xml
document
type, which uses a different method of dividing files into documents
(see More about XML).