On NCBI BLAST FTP Site ftp.ncbi.nlm.nih.gov/blast/
Tao Tao, Ph.D.
August 15, 2005
1. Introduction
BLAST sequence analysis is one of the services provided by NCBI. In addition to the well-known web servers, NCBI also provides standalone BLAST programs, their technical documents, and the commonly used databases through its ftp site at ftp.ncbi.nlm.nih.gov/blast/. This document lists the subdirectories and files found on this BLAST ftp site and provides the basic information on the file content and how those files should be used.
2. List of subdirectories under ftp.ncbi.nlm.nih.gov
There is one document and six subdirectories under the blast ftp site. Their name and contents are listed in the table below.
Table 2. Subdirectories under ftp.ncbi.nlm.nih.gov/blast |
|
Name |
Description |
blastftp.html |
Document on blast ftp site (This file) |
db |
Database subdirectory in preformatted or FASTA format |
demo |
Demonstration programs and documents from blast developers |
documents |
Documents on standalone, client, server blast programs |
executables |
Archives for binary distribution of blast programs for most common computer platforms |
matrices |
Protein and nucleotide scoring matrices for blast, only a subset is supported by blast |
Temp |
Subdirectory of miscellaneous files (currently empty) |
2.1 File content for the ftp.ncbi.nlm.nih.gov/blast/db/ subdirectory
This directory provides commonly used BLAST databases that are accessible by the Nucleotide, Protein, and Translated BLAST search pages. Those databases are provided in preformatted as well as FASTA formats. We strongly recommend that our users use the preformatted databases.
Databases larger than one gigabyte (1 GB) are formatted in multiple volumes each one gigabyte or less. Volumes belonging to the same master database are named using the “database.##.tar.gz” convention. All relevant volumes are required to have the complete set. An alias file is provided to tie the volumes together. The database can be called using the alias name without the .nal or .pal extension. For example, to call est database, use “–d est” option (without the quotes) in the command line.
Certain databases are subsets of a larger parental database. Those databases are provided as a mask files, rather than actual databases. The mask file needs the parent database to function properly. The parent databases should be generated on the same day as the mask file. For example, to use swissprot preformatted database, swissprot.tar.gz, users will need to get nr.tar.gz with the same date stamp.
To use the preformatted blast database file, first inflate the file using gzip (unix/linux), WinZip (window), or StuffIt Expander (Mac). The actual database files can be extracted out from the resulting tar archive using tar (unix/linux), WinZip (Window), or StuffIt Expander (Mac). Those resulting database files are ready for BLAST. More detailed information is in blastdb.html under the /documents subdirectory. File contents for this directory are described in the table below.
Table 2.1 File content for the ftp.ncbi.nlm.nih.gov/blast/db/ subdirectory |
|
Name 1 |
Content |
FASTA |
subdirectory with databases in FASTA format |
blastdb.html |
Content description of blast databases (this file) |
env_nr.tar.gz |
CDS translation of nucleotide sequences from environmental samples |
env_nr.tar.gz |
Nucleotide sequences from from environmental samples |
est.##.tar.gz |
Volumes for the est database. All are needed to reconstitute the complete est database. |
est_human.tar.gz |
Human est database mask file, requires the setup of all volumes of est database |
est_mouse.tar.gz |
Mouse est database mask file, requires the setup of all volumes of est database |
est_others.tar.gz |
Mask file for non-human, non-mouse subset of est database, requires the setup of all volumes of est database |
gss.##.tar.gz |
Volumes for the Genomic Survey Sequence database |
htgs.##.tar.gz |
Volumes for the High Throughput Genomic Sequences database, all volumes are needed to reconstitute complete htgs database |
human_genomic.tar.gz |
human chromosome database with 24 chromosomes, each containing concatenated contigs with N’s-adjusted gaps |
nr.tar.gz |
non-redundant protein database |
nt.##.tar.gz |
Volumes for nucleotide nr database (not non-redundant), all volumes are needed to reconstitute the database |
other_genomic.tar.gz |
Chromosome databases for organisms other than human |
pataa.tar.gz |
Patent protein sequence database |
patnt.tar.gz |
Patent nucleotide sequence database |
pdbaa.tar.gz |
Mask file for protein sequence for pdb entries, requires the setup of nr database |
pdbnt.tar.gz |
Database for nucleotide sequences from pdb entries. They are not coding sequences for the corresponding protein structure entries! |
refseq_genomic.##.tar.gz |
Genomic entries from the NCBI Reference Sequence Project, requires all volumes |
refseq_protein.tar.gz |
Protein entries from the NCBI Reference Sequence Project |
refseq_rna.tar.gz |
RNA entries from the NCBI Reference Sequence Project |
sts.tar.gz |
Database for Sequence Tag Site entries |
swissprot.tar.gz |
Mask file for swissprot entries, requires the setup of protein nr database |
taxdb.tar.gz |
taxonomy id database for use with preformatted databases |
wgs.##.tar.gz |
Volumes for Whole Genome Shotgun assemblies of various organisms, requires all volumes for proper reconstitution of the database |
NOTE: 1 ## are digits representing individual volumes. |
2.1.1 Files content for the ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ subdirectory
The FASTA database files are now stored in this subdirectory. It also contains some additional databases not available from the NCBI BLAST pages. Due to file size issues, the full est database is not provided. Users need to get the three subsets and concatenate them together to get the complete est database.
To use those databases with BLAST, these files will need to be formatted using formatdb program found in the standalone blast executable package, i.e., the blast initialed archives. The recommended command lines to use are:
formatdb –i input_db –p F –o T for nucleotide
formatdb –i input_db –p T –o T for protein
For additional information on formatdb, please see the formatdb.html under the /blast/documents/ subdirectory.
Table 2.1.1 Files Content for the ftp.ncbi.nlm.nih.gov/blast/db/FASTA subdirectory |
|
Name |
Content |
alu.a.gz |
proteins translated from alu.n 1 |
alu.n.gz |
alu repeat sequences 1 |
drosoph.aa.gz |
Drosophila protein from genome
annotation 1 |
drosoph.nt.gz |
Drosophila genome 1 |
ecoli.aa.gz |
E.coli K-12 proteins from genome annotation 1 |
ecoli.nt.gz |
E.coli K-12 genomic contigs 1 |
env_nr.gz |
Protein
sequences from environmental samples |
env_nt.gz |
Nucleotide
sequences from environmental samples |
est_human.gz |
human subset of the est database |
est_mouse.gz |
mouse subset of the est database |
est_others.gz |
Non-human non-mouse subset of the est database |
gss.gz |
Genomic Survey Sequences (mostly BAC ends) |
htgs.gz |
High Throughput Genomic Sequences |
human_genomic.gz |
Human chromosomes (NC_######) formed by concatenation of genomic contig assemblies (NT_######) and adjusting the gaps with N’s |
igSeqNt.gz |
Immunoglobulin nucleotide sequences |
igSeqProt.gz |
Immunoglobulin protein sequences |
mito.aa.gz |
protein from the annotated mitochondrial genomes 1 |
mito.nt.gz |
mitochondrial genomes 1 |
month.aa.gz |
Protein sequences released/updated in the past 30 days |
month.est_human.gz |
human subset of EST released/updated in the past 30 days |
month.est_mouse.gz |
mouse subset of EST released/updated in the past 30 days |
month.est_others.gz |
Non-human, non-mouse subset of EST
released/updated in the past 30 days |
month.gss.gz |
gss entries released/updated in the past 30 days |
month.htgs.gz |
htgs entries released/updated in the past 30 days |
month.nt.gz |
nt sequences released/updated in the past 30 days |
nr.gz |
non-redundant protein sequence database |
nt.gz |
nucleotide database from GenBank excluding htgs, est, gss,sts, pat divisions, and wgs entries. Not non-redundant. |
other_genomic.gz |
Chromosome entries other than human |
pataa.gz |
Patent protein sequence database |
patnt.gz |
Patent nucleotide sequence database |
pdbaa.gz |
protein sequences from pdb entries |
pdbnt.gz |
nucleotide
entries for pdb entries. they
are not the coding sequences for the corresponding protein entries. |
sts.gz |
Sequence Tag Sites database |
swissprot.gz |
swissprot
database |
vector.gz |
vector sequences from synthetic (syn) division of GenBank 1 |
wgs.gz |
Whole Genome Shotgun sequence assembly |
yeast.aa.gz |
protein translations from yeast genome annotation 1 |
Yeast genomic sequence 1 |
|
NOTE: 1 These files are not updated
regularly. |
2.2 File content for the ftp.ncbi.nlm.nih.gov/blast/demo/ subdirectory
This directory contains talks or posters NCBI BLAST developers presented in various conferences, specific technical documentation of special functions relevant to BLAST, as well as some demo tools for BLAST. The target audience is programmers and power users.
Table 2.2 Files Content for the ftp.ncbi.nlm.nih.gov/blast/demo/ subdirectory |
||
Name |
Content |
|
README.blast_demo |
readme
for blast_demo package |
|
README.first |
readme for this directory |
|
README.parse_blast_xml |
readme
for parse_blast_xml package |
|
benchmark |
Subdirectory with Package for benchmarking the BLAST performances. Files in the directory are listed below. |
|
1 |
BLAST_benchmarks.ppt |
document on the benchmark package |
1 |
benchmark.tar.gz |
Benchmark package for accessing the BLAST performances |
blast_demo.tar.gz |
blast_demo package on blast db, blastobj, and reformatting blast alignment from blastobj file |
|
blast_exercises.doc |
exercise set with sample questions and answers |
|
blast_programming.ppt |
PowerPoint presentation on BLAST programing |
|
blast_talk.ppt |
PowerPoint presentation, 2002 O'Reilly conference |
|
ieee_blast.final.ppt |
PowerPoint presentation, 2003 IEEE
conference |
|
ieee_talk.pdf |
Above IEEE presentation in PDF format |
|
parse_blast_xml.tar.gz |
demo package on parsing xml styled
blast output |
|
splitd.ppt |
PowerPoint presentation on splitd, a distributed computing setup implemented here at NCBI |
|
test_suite.tar.gz |
test
package (??) |
|
NOTE: 1 Yellow tab indicates files
under the previous subdirectory |
2.3 File content for the ftp.ncbi.nlm.nih.gov/blast/documents/ subdirectory
This directory contains documents on different programs found in the binary packages NCBI distributed from the BLAST ftp site under the /blast/executables/ subdirectory. Those relevant to standalone blast programs are also packaged in the binary distribution. They should be found in the /doc subdirectory once the standalone blast archive is extracted.
Table 2.3 Files Content for the ftp.ncbi.nlm.nih.gov/blast/documents/ subdirectory |
||
Name |
Content |
|
bl2seq.html |
List of
the program command line options |
|
blast-sc2004.pdf |
Poster
presentation on splitd system implementation for
BLAST server here at NCBI |
|
blast.html |
Setup/installation
information for standalone blast package |
|
blastall.html |
Core
command line program options and feature description on blastall
program |
|
blastclust.html |
Description
and list of command line program options for blastclust
|
|
blastdb.html |
Document
on blast databases under ftp.ncbi.nlm.nih.gov/blast/db/
|
|
blastftp.html |
General
description of NCBI blast ftp site (this file) |
|
blastpgp.html |
Document
on blastpgp (standalone PSI-BLAST) |
|
developer |
Subdirectory
for documents description specific C functions used buy
BLAST. Files in this subdirectory are listed below. |
|
blast_seqalign.txt |
A short
description on different types of seqalign
generated by blast |
|
1 |
readdb.txt |
A short
document on readbe function |
1 |
scoring.pdf |
An
comprehensive document on BLAST score and statistics |
1 |
urlapi.txt |
A
document on blasturl and its replacement URLAPI |
fastacmd.html |
Document
on fastacmd, a sequence retrieval/fasta sequence dump program |
|
filter.html |
Document
on the low complexity filter, the accepted inputs, and their functions |
|
formatdb.html |
Document
on formatdb, a program used to format FASTA input
sequence file into blastable database |
|
formatrpsdb.html |
Document
on formatrpsdb, a program used to format rpsblast databases from blastpgp
output |
|
history.html |
Document
on the changes and bug fixes for the past blast releases |
|
impala.html |
Document
on impala, a rpsblast like domain search tool |
|
index.html |
A list of
the available documents under this directory |
|
megablast.html |
Command
line program options for megablast |
|
netblast.html |
Document
on netblast setup, command line options, and
databases available for search remotely from NCBI |
|
rpsblast.html |
Document
on rpsblast program info on how to generate rpsblast database is outdated – refer to
formatrpsblast.html instead |
|
seedtop.html |
Detailed
setup and program options for seedtop, a pattern
matching program from NCBI |
|
web_blast.pl |
A sample
Perl Script for running BLAST searches using URLAPI |
|
Xml |
Subdirectory
with .dtd and .mod files for use with blast xml
output. Files in this subdirectory are listed below. |
|
1 |
NCBI_BlastOutput.dtd |
dtd file for blast output |
1 |
NCBI_BlastOutput.mod |
mod file
for blast output |
1 |
NCBI_Entity.mod |
mod file
for NCBI XML files |
1 |
README.blxml |
Documentation
on blast XML output |
NOTE: 1 Yellow tab indicates files
under the previous subdirectory |
2.4 File content for the ftp.ncbi.nlm.nih.gov/blast/executables/ subdirectory
This directory contains several subdirectories each links to a specific set of executable BLAST programs.
Table 2.4 Files Content of the ftp.ncbi.nlm.nih.gov/blast/excecutables/ subdirectory |
|
Name |
Content |
LATEST |
This
always links to the latest binary, official release or the latest snapshot
with interim bug fix(es) or feature enhancement(s) |
release 1 |
This
subdirectory contains the archives of past official releases dated back to
release 2.0.7, each within its own subdirectory |
Snapshot 1 |
Interim
bug-fixed and/or feature-enhanced binaries in-between official releases |
NOTE:
1 Contents of the release and
snapshot subdirectories will not be listed. |
2.4.1 File content of the ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ subdirectory
This subdirectory contains the latest BLAST binaries for common platforms. They are either the latest official release or the latest interim release with bug-fixes and/or important feature enhancement. There are three groups of binaries: standalone command line package with blast initialed file name, client blast package with netblast initialed file name, and local server blast package with wwwblast initialed file name.
The package naming convention is best demonstrated by the following example with each hyphen separated field representing the following fields from left to right, binary type, version, chipset, OS, and file extension:
blast-2.2.11-ia32-linux.tar.gz
The current version is 2.2.11. All the archives under this directory are listed in the table below. To make it more representative, the 2.2.11 version number is replaced with #.#.##.
Table 2.4.1 File content for the ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ subdirectory |
||
Type |
Archive Name |
Target Platform (Chipset/OS) |
C |
blast-#.#.##-axp64-tru64.tar.gz |
Compaq/HP
alpha running OSF/Tru64 |
O |
blast-#.#.##-ia32-freebsd.tar.gz |
Intel
Pentium PC running FreeBSD |
M |
blast-#.#.##-ia32-linux.tar.gz |
Intel
Pentium PC running Linux |
M |
blast-#.#.##-ia32-solaris.tar.gz |
Intel
Pentium PC running Solaris |
A |
blast-#.#.##-ia32-win32.exe |
Intel
Pentium PC running Windows |
N |
blast-#.#.##-ia64-linux.tar.gz |
Intel
64-bit processor PC running Linux |
D |
blast-#.#.##-mips64-irix.tar.gz |
SGI
64-bits |
L |
blast-#.#.##-ppc32-macosx.tar.gz |
MacOSX, Terminal (backend BSD Unix) |
I |
blast-#.#.##-sparc64-solaris.tar.gz |
Sun Sparc station running Solaris |
N |
blast-#.#.##-x64-linux.tar.gz |
Linux
64-bits system |
E |
blast-#.#.##-x64-solaris10.tar.gz |
Linux
64-bits system running Solaris 10 |
|
netblast-#.#.##-axp64-tru64.tar.gz |
Compaq/HP
alpha machine running OSF/Tru64 |
|
netblast-#.#.##-ia32-freebsd.tar.gz |
Intel
Pentium PC running FreeBSD |
C |
netblast-#.#.##-ia32-linux.tar.gz |
Intel
Pentium PC running Linux |
L |
netblast-#.#.##-ia32-solaris.tar.gz |
Intel
Pentium PC running Solaris |
I |
netblast-#.#.##-ia32-win32.exe |
Intel
Pentium PC running Windows |
E |
netblast-#.#.##-ia64-linux.tar.gz |
Intel
64-bit processor PC running Linux |
N |
netblast-#.#.##-mips64-irix.tar.gz |
SGI
64-bit |
T |
netblast-#.#.##-ppc32-macosx.tar.gz |
MacOSX, Terminal (backend BSD Unix) |
S |
netblast-#.#.##-sparc64-solaris.tar.gz |
Sun Sparc station running Solaris |
|
netblast-#.#.##-x64-linux.tar.gz |
Linux
64-bits system |
|
netblast-#.#.##-x64-solaris10.tar.gz |
Linux
64-bits system running Solaris 10 |
W |
wwwblast-#.#.##-axp64-tru64.tar.gz |
Compaq/HP
alpha machine running OSF/Tru64 |
E |
wwwblast-#.#.##-ia32-freebsd.tar.gz |
Intel
Pentium PC running FreeBSD |
B |
wwwblast-#.#.##-ia32-linux.tar.gz |
Intel
Pentium PC running Linux |
|
wwwblast-#.#.##-ia32-solaris.tar.gz |
Intel
Pentium PC running Solaris |
S |
wwwblast-#.#.##-ia64-linux.tar.gz |
Intel
64-bit processor PC running Linux |
E |
wwwblast-#.#.##-mips64-irix.tar.gz |
SGI
64-bit |
R |
wwwblast-#.#.##-ppc32-macosx.tar.gz |
MacOSX, Terminal (backend BSD Unix) |
V |
wwwblast-#.#.##-sparc64-solaris.tar.gz |
Sun Sparc station running Solaris |
E |
wwwblast-#.#.##-x64-linux.tar.gz |
Linux 64-bits
system |
R |
wwwblast-#.#.##-x64-solaris10.tar.gz |
Linux
64-bits system running Solaris 10 |
NOTE: Three types of binaries are
available: commandline, clients, and web server.
They are grouped accordingly in the above table. Commandline archive is for setting
up blast on users’ local machine with tools for preparing databases and
running all type of searches locally. Client archive is for
configure blast searches locally and sending the searches over internet to
NCBI. Only common blast searches are available. It is batch capable. Web server blast is for setting blast web pages locally under a
existing web server setup (such as Apache) and run blast searches with a
graphical user interface (web page). |
2.6 File content of ftp.ncbi.nlm.nih.gov/blast/matrices/ subdirectory
This directory contains the BLOSUM and PAM protein scoring matrices. Even though all the protein matrices listed can be used by BLAST, statistically BLAST can only support 5 protein score matrices. These matrices are colored in yellow.
2.6 File content of ftp.ncbi.nlm.nih.gov/blast/matrices/
subdirectory |
||||
BLOSUM Family of Matrices |
PAM Family of Matrices |
|||
BLOSUM30 |
BLOSUM65 |
PAM10 |
PAM160.cdi |
PAM330 |
BLOSUM30 |
BLOSUM65 |
PAM10 |
PAM160.cdi |
PAM330 |
BLOSUM30 |
BLOSUM65 |
PAM10 |
PAM160.cdi |
PAM330 |
BLOSUM30.50 |
BLOSUM65.50 |
PAM20 |
PAM170 |
PAM340 |
BLOSUM35 |
BLOSUM70 |
PAM30 1 |
PAM180 |
PAM350 |
BLOSUM35.50 |
BLOSUM70.50 |
PAM40 |
PAM190 |
PAM360 |
BLOSUM40 |
BLOSUM75 |
PAM40.cdi |
PAM200 |
PAM370 |
BLOSUM40.50 |
BLOSUM75.50 |
PAM50 |
PAM200.cdi |
PAM380 |
BLOSUM45 1 |
BLOSUM80 1 |
PAM60 |
PAM210 |
PAM390 |
BLOSUM45.50 |
BLOSUM80.50 |
PAM70 1 |
PAM220 |
PAM400 |
BLOSUM50 |
BLOSUM85 |
PAM80 |
PAM230 |
PAM410 |
BLOSUM50.50 |
BLOSUM85.50 |
PAM80.cdi |
PAM240 |
PAM420 |
BLOSUM55 |
BLOSUM90 |
PAM90 |
PAM250 |
PAM430 |
BLOSUM55.50 |
BLOSUM90.50 |
PAM100 |
PAM250.cdi |
PAM440 |
BLOSUM60 |
BLOSUM100 |
PAM110 |
PAM260 |
PAM450 |
BLOSUM60.50 |
BLOSUM100.50 |
PAM120 |
PAM270 |
PAM460 |
BLOSUM62 1 |
BLOSUMN |
PAM120.cdi |
PAM280 |
PAM470 |
BLOSUM62.50 |
BLOSUMN.50 |
PAM130 |
PAM290 |
PAM480 |
DAYHOFF |
GONNET |
PAM140 |
PAM300 |
PAM490 |
IDENTITY |
MATCH |
PAM150 |
PAM310 |
PAM500 |
NUC.4.2 |
NUC.4.4 |
PAM160 |
PAM320 |
|
NOTE: 1 Five matrices statistically
supported by BLAST are colored yellow. 2 Special protein matrices are colored
in light blue. 3 Two nucleotide matrices are colored
in purple. |
3. Techinical Support
Additional questions and/or comments on this ftp site as well as this document should be directed to NCBI User Service:
blast-help@ncbi.nlm.nih.gov
Questions on general NCBI resources should be directed to:
info@ncbi.nlm.nih.gov