Namazu User's Manual


You can get the latest news about Namazu at http://www.namazu.org/. Namazu is a free software under the terms of the GNU General Public License version 2 with ABSOLUTELY NO WARRANTY.

Table of contents

Namazu components

Namazu is a full-text search engine. Namazu has an index maker, mknmz command, and a text searcher, namazu command.

For searching a great amount of documents quickly, Namazu makes an index in advance. The concept of index is just similar to an index of book.

mknmz command makes the index. The target directory for indexing is given as an argument for mknmz. For example, if the target directory is /home/foo/public_html, then type

% mknmz /home/foo/public_html

Then, documents such as *.html and *.txt under /home/foo/public_html are indexed and NMZ.* files are created in the directory you run mknmz. NMZ.* files are the Namazu's index.

namazu command searches the created index. For example:

% namazu bar /home/foo/Namazu/foobar

The above searches a keyword "bar" for the index under /home/foo/Namazu/bar.

mknmz command

mknmz's options

    mknmz 2.0.5, an indexer of Namazu.
    
    Usage: mknmz [options] <target>...
    
    Target files:
      -a, --all                target all files.
      -t, --media-type=MTYPE   set the media type for all target files to MTYPE.
      -h, --mailnews           same as --media-type='message/rfc822'
          --mhonarc            same as --media-type='text/html; x-type=mhonarc'
      -F, --target-list=FILE   load FILE which contains a list of target files.
          --allow=PATTERN      set PATTERN for file names which should be allowed.
          --deny=PATTERN       set PATTERN for file names which should be denied.
          --exclude=PATTERN    set PATTERN for pathnames which should be excluded.
      -e, --robots             exclude HTML files containing
                               <meta name="ROBOTS" content="NOINDEX">
      -M, --meta               handle HTML meta tags for field-specified search.
      -r, --replace=CODE       set CODE for replacing URI.
          --html-split         split an HTML file with <a name="..."> anchors.
          --mtime=NUM          limit by mtime just like find(1)'s -mtime option.
                               e.g., -50 for recent 50 days, +50 for older than 50.
    
    Morphological Analysis:
      -c, --use-chasen         use ChaSen for analyzing Japanese.
      -k, --use-kakasi         use KAKASI for analyzing Japanese.
      -m, --use-chasen-noun    use ChaSen for extracting only nouns.
    
    Text Operations:
      -E, --no-edge-symbol     remove symbols on edge of word.
      -G, --no-okurigana       remove Okurigana in word.
      -H, --no-hiragana        ignore words consist of Hiragana only.
      -K, --no-symbol          remove symbols.
    
    Summarization:
      -U, --no-encode-uri      do not encode URI.
      -x, --no-heading-summary do not make summary with HTML's headings.
    
    Index Construction:
          --update=INDEX       set INDEX for updating.
      -Y, --no-delete          do not detect removed documents.
      -Z, --no-update          do not detect update and deleted documents.
    
    Miscellaneous:
      -s, --checkpoint         turn on the checkpoint mechanism.
      -C, --show-config        show the current configuration.
      -f, --config=FILE        use FILE as a config file.
      -I, --include=FILE       include your customization FILE.
      -O, --output-dir=DIR     set DIR to output the index.
      -T, --template-dir=DIR   set DIR having NMZ.{head,foot,body}.*.
      -q, --quiet              suppress status messages during execution.
      -v, --version            show the version of namazu and exit.
      -V, --verbose            be verbose.
          --debug              be debug mode.
          --help               show this help and exit.
    
    Report bugs to <bug-namazu@namazu.org>.

mknmzrc settings

Various setting is possible in mknmzrc or .mknmzrc. mknmzrc reads configuration files normally in the order of

  1. $(sysconfdir)/$(PACKAGE)/mknmzrc
    Usually, /usr/local/etc/mknmz/mknmzrc
  2. ~/.mknmzrc
  3. file which is specified by -f or --config=FILE --option.

If more than one configuration files are found, they all are loaded.

Installation prepares a sample configuration file $(sysconfdir)/$(PACKAGE)/mknmzrc-sample. You can copy this to $(sysconfdir)/$(PACKAGE)/mknmzrc or to ~/.mknmzrc in your home directory.

The setting details are given as comments in mknmzrc-sample.

Document filters

mknmz automatically identifies target file types and performs appropriate document filering. For HTML documents, filtering includes extraction of <title> or deletion of HTML tags. These filtering are taken care by document filters in $(datadir)/$(PACKAGE)/filter. The standard document fileters are described below.

gzip.pl
Handle a gzipped file
Requirement: gzip command or Compress::Zlib perl module.
bzip2.pl
Handle a bzip2-ed file
Requirement: bzip2 command.
compress.pl
Handle a compress-ed file.
Requirement: compress command.
excel.pl
Handle an Micosoft Excel file.
Requirement: xlHtml
Suggested software:lv (it only needs for Japanese documents)
hnf.pl
Handle a file of Hyper NIKKI System Project.
Requirement: hnf filter is special. It requires namazu_for_hns of Hyper NIKKI System Project.
html.pl
Handle a HTML file.
Requirement: None
mailnews.pl
Handle a file of Mail/News.
Requirement: None
man.pl
Handle a man file.
Requirement: nroff, groff or jgroff
Note: To handle Japanese man, groff supporting -Tnippon is required.
mhonarc.pl
Handle a MHonArc file.
Requirement: None
msword.pl
Handle a Microsoft Word file.
Requirement: wvWare
Suggested software:lv (it only needs for Japanese documents)
pdf.pl
Handle a PDF file.
Requirement: pdftotext, a part of xpdf (version 0.91 is suggested).
rfc.pl
Handle a RFC file.
Requirement: None
tex.pl
Handle a TeX file.
Requirement: detex

The following filters are for Windows only.

ichitaro456.pl
Handle a file of Ichitaro a Japanese word processor version 4, 5 and 6.
Requirement: JSTXT
Note: JSTXT is a tool for MS-DOS.
oleexcel.pl
Handle an Microsoft Excel file.
Requirement: Microsoft Excel 97
olemsword.pl
Handle a Microsoft Word file.
Requirement: Microsoft Word 97
olepowerpoint.pl
Handle a Microsoft PowerPoint file.
Requirement: Microsoft PowerPoint 97

namazu command

namazu's options

    namazu 2.0.5, a search program of Namazu.
    
    Usage: namazu [options] <query> [index]... 
        -n, --max=NUM        set number of documents shown to NUM.
        -w, --whence=NUM     set first number of documents shown to NUM.
        -l, --list           print results by listing format.
        -s, --short          print results by short format.
            --results=EXT    set NMZ.result.EXT for printing results.
            --late           sort documents in late order.
            --early          sort documents in early order.
            --sort=METHOD    set a sort METHOD (score, date, field:name)
            --ascending      sort in ascending order (default: descending)
        -a, --all            print all results.
        -c, --count          print only number of hits.
        -h, --html           print in HTML format.
        -r, --no-references  do not display reference hit counts.
        -H, --page           print further result links. (nearly meaningless)
        -F, --form           force to print <form> ... </form> region.
        -R, --no-replace     do not replace URI string.
        -U, --no-decode-uri  do not decode URI when printing in a plain format.
        -o, --output=FILE    set the output file name to FILE.
        -f, --config=FILE    set the config file name to FILE.
        -C, --show-config    print current configuration.
        -q, --quiet          do not display extra messages except search results.
        -d, --debug          be debug mode.
        -v, --version        show the version of namazu and exit.
            --help           show this help and exit
    
    Report bugs to <bug-namazu@namazu.org>.

You can specify one or more target indices in a command-line argument [index dir].... If omitted, the Default index will be treated as the target index.

By prefixing + such as +foo or +bar, you can specify a target index as a relative path from the default index.

When executed from a command line, Namazu outputs query results in simple text format. -h option is required to display query results in HTML format.

If you want to display query results from 21st hits to 40th hits, type -n 20 -w 20 in a command line option. Note -w is not 21 in this example.

namazurc settings

Various setting is possible in mknmzrc or .mknmzrc. namazu reads configuration files normally in the order of

  1. $(sysconfdir)/$(PACKAGE)/namazurc
    (Usually, /usr/local/etc/mknmz/namazurc
  2. ~/.namazurc
  3. file which is specified by -f or --config=FILE --option.
    (In case of CGI, it is .namazurc in the directory namazu.cgi is stored)

If more than one configuration files are found, they all are loaded.

Installation prepares a sample configuration file $(sysconfdir)/$(PACKAGE)/namazurc-sample. You can copy this to $(sysconfdir)/$(PACKAGE)/namazurc or to ~/.namazurc in your home directory.

The setting details are given as comments in namazurc-sample.

Default Index

Default index is the index which is used by default. Defualt index follows the rules described bellow.

In CGI(namazu.cgi), the index selection is given as a relative path from the default index.

namazu.cgi

namazu.cgi installation

namazu.cgi is CGI for Namazu. namazu.cgi is installed in $(libexecdir) directory (usually /usr/local/libexec). If you copy namazu.cgi into a CGI directory of your system, installation is done!

.namazurc settings

If you have .namazurc file in the directory namazu.cgi is stored, .namazurc file will be treated as the CGI congifuration file. To display Japanese, you need the following settings.

Lang ja

Template files

Template files define display styles of query results in HTML format. The details are described below.

NMZ.head
Header of search results.
NMZ.foot
Footer of search results.
NMZ.body
Description of Namazu's query.
NMZ.tips
Tips on searching.
NMZ.result
Format of search results.

These files are prepared for each language. Files suffixed by .ja are for Japanese.

Form settings

form is defined in NMZ.head. CGI variables are as follows:

query
specify a query expression.
max
specify the maximum number of query results to display at once.
result
specify the display style of query results.
sort
specify the sorting routine.
idxname
specify the name of index to search.
subquery
specify the sub-query expression.
whence
specify where you wish to display query results.
reference
specify wheter reference hit counts is desplayed or not.
lang
specify language of search results.

Selecting an index

To select an index from the browser, NMZ.head needs the following.


      <strong>Target:</strong>
      <select name="idxname">
      <option selected value="foo">foo
      <option value="bar">bar
      <option value="baz">baz
      </select>

In the above example, you can select a single index from foo, bar, or baz. When foo is selected, Namazu searches the foo index under the default index. In case the default index is /usr/local/var/namazu/index, we will have directories as follows.


       /
       + usr/
         + local/
           + var/
             + namazu/
               + index/
                 + foo/
                 + bar/
                 + baz/

Selecting multiple indices

For selecting multiple indices, NMZ.head need a checkbox.


      <strong>Target</strong>
      <ul>
      <li><input type="checkbox" name="idxname" value="foo">foo
      <li><input type="checkbox" name="idxname" value="bar">bar
      <li><input type="checkbox" name="idxname" value="baz">baz
      </ul>

In the above example, you can select multiple indecies from foo, bar, and/or baz. The template file specified in Template directive in namazurc is used in searching. If no file is given in Template directive, the following rules apply.

Using an auxiliary query

You can set an auxiliary query apart from a user-inputing query. The following example describes a way to limit target pages by URI.


      <strong>Target</strong>
      <select name="subquery">
      <option value="">All
      <option value="+uri:/^http://foo.bar.jp/foo//">foo's pages
      <option value="+uri:/^http://foo.bar.jp/bar//">bar's pages
      <option value="+uri:/^http://foo.bar.jp/baz//">baz's pages
      <option value="+uri:/^http://foo.bar.jp/quux//">quux's pages
      </select>
      <input type="hidden" name="reference" value="off">

Selecting language of search results

To select language of search results by Web browser, you can set NMZ.head with CGI variable lang as follows:


      <strong>Language:</strong>
      <select name="lang">
      <option selected value="ja">Japanese
      <option value="">English
      </select>

NOTE: Lang directive in .namazurc has precedence over CGI variable lang. So you must not set Lang directive in .namazurc if you want to use CGI variable lang.

Included tools

bnamazu

bnamazu is a search tool operates with Web browsers. The query results are passed to a Web browser (default: lynx ) for users to browse. The command line options are as follows.

% bnamazu [-n] [-b browser] [namazu's options] <query> [index]...

-b option specifies a Web browser. -n option is valid only when using netscape. -n option opens a new netscape window to display query results.

nmzgrep

nmzgrep is an search tool operates with egrep command. nmzgrep executes egrep for retrieved documents. By applying egrep, you can find line numbers the keyword is found. The command line options are as follows.

% nmzgrep [egrep's options] <pattern> [index]...

For example, to search ~/Namazu/foobar index for foo and apply egrep to searched documents, you can do as follows:

% nmzgrep foo ~/Namazu/foobar

gcnmz

When you repeat updates of indecies caused by addition and/or deleteion of documents, garbage will be created in index files. gcnmz is a tool for garbage collection. The command line options are as follows.

% gcnmz [options] <target>...

To run garbage collection for indecies in ~/Namazu/foobar, type

% gcnmz ~/Namazu/foobar

mailutime

mailutime is a tool to set timestamps of Mail/News files to Date: header. The command line options are as follows.

% mailutime <target>...

To change time stamps of emails stored in ~/Mail/ml/foobar, type

% mailutime ~/Mail/ml/foobar/*

vfnmz

vfnmz is a tool to preview search results. The command line options are as follows.

% vfnmz <index> [NMZ.result.foobar]

To preview indecies stored in ~/Mail/ml/foobar, type

% vfnmz ~/Namazu/foobar > foobar.html
% lynx foobar.html

rfnmz

rfnmz is a tool to reconstruct NMZ.field.*.i files. Usage:

% rfnmz <index>

For example, to reconstruct NMZ.field.*.i files in ~/Namazu/foobar index, you can do as follows:

% rfnmz ~/Namazu/foobar > foobar.html

Query

Single term query

The query specifies only one term for retrieving all documents which contain the term. e.g.,

namazu

AND query

The query specifies two or more terms for retrieving all documents which contain both terms. You can insert the and operator between the terms. e.g.,

Linux and Netscape

You can ommit the and operator. Terms which is separated by one ore more spaces is assumed to be AND query.

OR query

The query specifies two or more terms for retrieving all documents which contain either term. You can insert the or operator between the terms. e.g.,

Linux or FreeBSD

NOT query

The query specifies two or more terms for retrieving all documents which contain a first term but does't contain the following terms. You can insert the not operator between the terms to do NOT query. e.g.,

Linux not UNIX

Grouping

You can group queries by surrounding them by parentheses. The parentheses should be separated by one or more spaces. e.g.,

( Linux or FreeBSD ) and Netscape not Windows

Phrase searching

You can search for a phrase which consists of two or more terms by surrounding them with double quotes like "..." or with braces like {...}. In Namazu, precision of phrase searching is not 100 %, so it causes wrong results occasionally. e.g.,

{GNU Emacs}

Substring matching

The are three types of substring matching searching.

Prefix matching
inter* (terms which begin with inter)
Inside matching
*text* (terms which contain text)
Suffix matching
*net (terms which terminated with net)

Regular expressions

You can use regular expressions for pattern matching. The regular expressions must be surrounded by slashes like /.../. Namazu uses Ruby's regular regular expressions engine. It offers generally Perl compatible flavor. e.g.,

/pro(gram|blem)s?/

Field-specified searching

You can limit your search to specific fields such as Subject:, From:, Message-Id:. It's especially convenient for Mail/News documents. e.g.,

Notes


Namazu Homepage

$Id: manual.html,v 1.37 2000/10/12 09:06:54 rug Exp $
developers@namazu.org