The FASTA specifications just define the sequence file as a header line that begins with > and subsequent lines containing the sequence. The header line can be present in an almost infinite number of formats, several of which can be processed by EMBOSS. EMBOSS attempts to determine the accession number and/or ID for each sequence. For indexing purposes there is no semantic difference between an accession number and an ID. In the real world, acession numbers are immutable, ie. they do not change with subsequent releases of the dataabse, but ID's may change. In any case IDs and accession numbers are unique, and that is all that matters for database indexing EMBOSS.
The program used to process FASTA format databases is DBIFASTA. It can recognise the following header line formats:
Other header formats will not be recognised by DBIFASTA and will cause indexing and/or database lookup to fail. If you have a different header format that DBIFASTA cannot yet handle you have two options:
To index a FASTA format database, run DBIFASTA.
% dbifasta Index a fasta database simple : >ID idacc : >ID ACC gcgid : >db:ID gcgidacc : >db:ID ACC ncbi : >blah|...[|ACC]|ID ID line format [idacc]: Database name: mydb Database directory [.]: Wildcard database filename [*.dat]: mydb.fasta Release number [0.0]: Index date [00/00/00]:
DBIFASTA will chug along for a little while and will produce the index files. You can use the same indexdir options as for DBIFLAT,DBIGCG and DBIBLAST to place the indices in a different directory.
Place the following entry in your .embossrc
DB mydb [ type: P method: emblcd format: fasta dir: $emboss_db_dir/mydb file: mydb.fasta comment: "My database" ]
format: should be fasta, ncbi or dbid, possibly not the format you specified when running DBIFASTA. The same file: and include: tags can be used as for the other database indexing programs.