On NCBI BLAST FTP Site ftp.ncbi.nlm.nih.gov/blast/ 
 
Tao Tao, Ph.D.
August 15, 2005
 
 
1. Introduction
 
BLAST sequence analysis is one of the services provided by NCBI. In addition to the well-known web servers, NCBI also provides standalone BLAST programs, their technical documents, and the commonly used databases through its ftp site at ftp.ncbi.nlm.nih.gov/blast/. This document lists the subdirectories and files found on this BLAST ftp site and provides the basic information on the file content and how those files should be used. 
 
2. List of subdirectories under ftp.ncbi.nlm.nih.gov  
 
There is one document and six subdirectories under the blast ftp site. Their name and contents are listed in the table below.
 

Table 2. Subdirectories under ftp.ncbi.nlm.nih.gov/blast

Name
Description
blastftp.html
Document on blast ftp site (This file)
db

Database subdirectory in preformatted or FASTA format

demo
Demonstration programs and documents from blast developers
documents
Documents on standalone, client, server blast programs
executables
Archives for binary distribution of blast programs for most common computer platforms 
matrices
Protein and nucleotide scoring matrices for blast, only a subset is supported by blast 
Temp
Subdirectory of miscellaneous files (currently empty)
 
 
2.1 File content for the ftp.ncbi.nlm.nih.gov/blast/db/ subdirectory
 
This directory provides commonly used BLAST databases that are accessible by the Nucleotide, Protein, and Translated BLAST search pages. Those databases are provided in preformatted as well as FASTA formats. We strongly recommend that our users use the preformatted databases.
 
Databases larger than one gigabyte (1 GB) are formatted in multiple volumes each one gigabyte or less. Volumes belonging to the same master database are named using the “database.##.tar.gz” convention. All relevant volumes are required to have the complete set. An alias file is provided to tie the volumes together. The database can be called using the alias name without the .nal or .pal extension. For example, to call est database, use “–d est” option (without the quotes) in the command line. 
 
Certain databases are subsets of a larger parental database. Those databases are provided as a mask files, rather than actual databases. The mask file needs the parent database to function properly. The parent databases should be generated on the same day as the mask file. For example, to use swissprot preformatted database, swissprot.tar.gz, users will need to get nr.tar.gz with the same date stamp.
 
To use the preformatted blast database file, first inflate the file using gzip (unix/linux), WinZip (window), or StuffIt Expander (Mac). The actual database files can be extracted out from the resulting tar archive using tar (unix/linux), WinZip (Window), or StuffIt Expander (Mac). Those resulting database files are ready for BLAST. More detailed information is in blastdb.html under the /documents subdirectory. File contents for this directory are described in the table below.
 
Table 2.1 File content for the ftp.ncbi.nlm.nih.gov/blast/db/ subdirectory
Name 1
Content
FASTA
subdirectory with databases in FASTA format
blastdb.html
Content description of blast databases (this file)
env_nr.tar.gz
CDS translation of nucleotide sequences from environmental samples
env_nr.tar.gz
Nucleotide sequences from from environmental samples
est.##.tar.gz
Volumes for the est database. All are needed to reconstitute the complete est database.
est_human.tar.gz
Human est database mask file, requires the setup of all volumes of est database
est_mouse.tar.gz
Mouse est database mask file, requires the setup of all volumes of est database
est_others.tar.gz
Mask file for non-human, non-mouse subset of est database, requires the setup of all volumes of est database
gss.##.tar.gz
Volumes for the Genomic Survey Sequence database
htgs.##.tar.gz
Volumes for the High Throughput Genomic Sequences database, all volumes are needed to reconstitute complete htgs database
human_genomic.tar.gz
human chromosome database with 24 chromosomes, each containing concatenated contigs with N’s-adjusted gaps
nr.tar.gz
non-redundant protein database
nt.##.tar.gz
Volumes for nucleotide nr database (not non-redundant), all volumes are needed to reconstitute the database
other_genomic.tar.gz
Chromosome databases for organisms other than human
pataa.tar.gz
Patent protein sequence database
patnt.tar.gz
Patent nucleotide sequence database
pdbaa.tar.gz
Mask file for protein sequence for pdb entries, requires the setup of nr database
pdbnt.tar.gz
Database for nucleotide sequences from pdb entries. They are not coding sequences for the corresponding protein structure entries!
refseq_genomic.##.tar.gz
Genomic entries from the NCBI Reference Sequence Project, requires all volumes
refseq_protein.tar.gz
Protein entries from the NCBI Reference Sequence Project
refseq_rna.tar.gz
RNA entries from the NCBI Reference Sequence Project 
sts.tar.gz
Database for Sequence Tag Site entries
swissprot.tar.gz
Mask file for swissprot entries, requires the setup of protein nr database
taxdb.tar.gz
taxonomy id database for use with preformatted databases
wgs.##.tar.gz
Volumes for Whole Genome Shotgun assemblies of various organisms, requires all volumes for proper reconstitution of the database  
NOTE: 
1 ## are digits representing individual volumes.
 
 
2.1.1 Files content for the ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ subdirectory
 
The FASTA database files are now stored in this subdirectory. It also contains some additional databases not available from the NCBI BLAST pages. Due to file size issues, the full est database is not provided. Users need to get the three subsets and concatenate them together to get the complete est database.
 
To use those databases with BLAST, these files will need to be formatted using formatdb program found in the standalone blast executable package, i.e., the blast initialed archives.  The recommended command lines to use are:
               formatdbi input_db –p F –o T                 for nucleotide           
               formatdbi input_db –p T –o T                 for protein
 
For additional information on formatdb, please see the formatdb.html under the /blast/documents/ subdirectory.
 
 
Table 2.1.1 Files Content for the
ftp.ncbi.nlm.nih.gov/blast/db/FASTA subdirectory

Name

Content

alu.a.gz

proteins translated from alu.n 1

alu.n.gz

alu repeat sequences 1

drosoph.aa.gz

Drosophila protein from genome annotation 1

drosoph.nt.gz

Drosophila genome 1

ecoli.aa.gz

E.coli K-12 proteins from genome annotation 1

ecoli.nt.gz

E.coli K-12 genomic contigs 1

env_nr.gz

Protein sequences from environmental samples

env_nt.gz

Nucleotide sequences from environmental samples

est_human.gz

human subset of the est database

est_mouse.gz

mouse subset of the est database

est_others.gz

Non-human non-mouse subset of the est database

gss.gz

Genomic Survey Sequences (mostly BAC ends) 

htgs.gz

High Throughput Genomic Sequences

human_genomic.gz

Human chromosomes (NC_######) formed by concatenation of genomic contig assemblies (NT_######) and adjusting the gaps with N’s

igSeqNt.gz

Immunoglobulin nucleotide sequences

igSeqProt.gz

Immunoglobulin protein sequences

mito.aa.gz

protein from the annotated mitochondrial genomes 1

mito.nt.gz

mitochondrial genomes 1

month.aa.gz

Protein sequences released/updated in the past 30 days

month.est_human.gz

human subset of EST released/updated in the past 30 days

month.est_mouse.gz

mouse subset of EST released/updated in the past 30 days

month.est_others.gz

Non-human, non-mouse subset of EST released/updated in the past 30 days

month.gss.gz

gss entries released/updated in the past 30 days 

month.htgs.gz

htgs entries released/updated in the past 30 days

month.nt.gz

nt sequences released/updated in the past 30 days

nr.gz

non-redundant protein sequence database

nt.gz

nucleotide database from GenBank excluding htgs, est, gss,sts, pat divisions, and wgs entries.  Not non-redundant.

other_genomic.gz

Chromosome entries other than human

pataa.gz

Patent protein sequence database

patnt.gz

Patent nucleotide sequence database

pdbaa.gz

protein sequences from pdb entries

pdbnt.gz

nucleotide entries for pdb entries. they are not the coding sequences for the corresponding protein entries.

sts.gz

Sequence Tag Sites database 

swissprot.gz

swissprot database

vector.gz

vector sequences from synthetic (syn) division of GenBank 1

wgs.gz

Whole Genome Shotgun sequence assembly

yeast.aa.gz

protein translations from yeast genome annotation 1

yeast.nt.gz

Yeast genomic sequence 1

NOTE:

1 These files are not updated regularly.

 
 
2.2 File content for the ftp.ncbi.nlm.nih.gov/blast/demo/ subdirectory
 
This directory contains talks or posters NCBI BLAST developers presented in various conferences, specific technical documentation of special functions relevant to BLAST, as well as some demo tools for BLAST. The target audience is programmers and power users.
 
Table 2.2 Files Content for the
ftp.ncbi.nlm.nih.gov/blast/demo/ subdirectory

Name

Content

README.blast_demo        

readme for blast_demo package

README.first              
readme for this directory 

README.parse_blast_xml

readme for parse_blast_xml package

benchmark

Subdirectory with Package for benchmarking the BLAST performances. Files in the directory are listed below.

1

BLAST_benchmarks.ppt

document on the benchmark package

1

benchmark.tar.gz

Benchmark package for accessing the BLAST performances

blast_demo.tar.gz

blast_demo package on blast db, blastobj, and reformatting blast alignment from blastobj file

blast_exercises.doc

exercise set with sample questions and answers

blast_programming.ppt

PowerPoint presentation on BLAST programing

blast_talk.ppt

PowerPoint presentation, 2002 O'Reilly conference

ieee_blast.final.ppt

PowerPoint presentation, 2003 IEEE conference

ieee_talk.pdf

Above IEEE presentation in PDF format

parse_blast_xml.tar.gz

demo package on parsing xml styled blast output

splitd.ppt

PowerPoint presentation on splitd, a distributed computing setup implemented here at NCBI

test_suite.tar.gz

test package (??)

NOTE:

1 Yellow tab indicates files under the previous subdirectory

 
 
2.3 File content for the ftp.ncbi.nlm.nih.gov/blast/documents/ subdirectory
 
This directory contains documents on different programs found in the binary packages NCBI distributed from the BLAST ftp site under the /blast/executables/ subdirectory. Those relevant to standalone blast programs are also packaged in the binary distribution. They should be found in the /doc subdirectory once the standalone blast archive is extracted.
 
Table 2.3 Files Content for the
ftp.ncbi.nlm.nih.gov/blast/documents/ subdirectory

Name

Content

bl2seq.html

List of the program command line options

blast-sc2004.pdf

Poster presentation on splitd system implementation for BLAST server here at NCBI

blast.html

Setup/installation information for standalone blast package

blastall.html

Core command line program options and feature description on blastall program

blastclust.html

Description and list of command line program options for blastclust

blastdb.html

Document on blast databases under ftp.ncbi.nlm.nih.gov/blast/db/

blastftp.html

General description of NCBI blast ftp site (this file)

blastpgp.html

Document on blastpgp (standalone PSI-BLAST)

developer

Subdirectory for documents description specific C functions used buy BLAST. Files in this subdirectory are listed below.

1

blast_seqalign.txt

A short description on different types of seqalign generated by blast

1

readdb.txt

A short document on readbe function

1

scoring.pdf

An comprehensive document on BLAST score and statistics

1

urlapi.txt

A document on blasturl and its replacement URLAPI

fastacmd.html

Document on fastacmd, a sequence retrieval/fasta sequence dump program

filter.html

Document on the low complexity filter, the accepted inputs, and their functions

formatdb.html

Document on formatdb, a program used to format FASTA input sequence file into blastable database

formatrpsdb.html

Document on formatrpsdb, a program used to format rpsblast databases from blastpgp output

history.html

Document on the changes and bug fixes for the past blast releases

impala.html

Document on impala, a rpsblast like domain search tool

index.html

A list of the available documents under this directory

megablast.html

Command line program options for megablast

netblast.html

Document on netblast setup, command line options, and databases available for search remotely from NCBI

rpsblast.html

Document on rpsblast program info on how to generate rpsblast database is outdated – refer to formatrpsblast.html instead

seedtop.html

Detailed setup and program options for seedtop, a pattern matching program from NCBI

web_blast.pl

A sample Perl Script for running BLAST searches using URLAPI

Xml

Subdirectory with .dtd and .mod files for use with blast xml output. Files in this subdirectory are listed below.

1

NCBI_BlastOutput.dtd

dtd file for blast output

1

NCBI_BlastOutput.mod

mod file for blast output

1

NCBI_Entity.mod

mod file for NCBI XML files

1

README.blxml

Documentation on blast XML output

NOTE:

1 Yellow tab indicates files under the previous subdirectory

 
 
2.4 File content for the ftp.ncbi.nlm.nih.gov/blast/executables/ subdirectory
 
This directory contains several subdirectories each links to a specific set of executable BLAST programs. 
 
Table 2.4 Files Content of the 
ftp.ncbi.nlm.nih.gov/blast/excecutables/ subdirectory

Name

Content

LATEST

This always links to the latest binary, official release or the latest snapshot with interim bug fix(es) or feature enhancement(s)

release 1

This subdirectory contains the archives of past official releases dated back to release 2.0.7, each within its own subdirectory

Snapshot 1

Interim bug-fixed and/or feature-enhanced binaries in-between official releases

NOTE:

1 Contents of the release and snapshot subdirectories will not be listed.

 
 
2.4.1 File content of the ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ subdirectory
 
This subdirectory contains the latest BLAST binaries for common platforms. They are either the latest official release or the latest interim release with bug-fixes and/or important feature enhancement. There are three groups of binaries: standalone command line package with blast initialed file name, client blast package with netblast initialed file name, and local server blast package with wwwblast initialed file name.
 
The package naming convention is best demonstrated by the following example with each hyphen separated field representing the following fields from left to right, binary type, version, chipset, OS, and file extension: 
 
               blast-2.2.11-ia32-linux.tar.gz  
 
The current version is 2.2.11. All the archives under this directory are listed in the table below. To make it more representative, the 2.2.11 version number is replaced with #.#.##.
 
Table 2.4.1 File content for the 
ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ subdirectory

Type

Archive Name

Target Platform (Chipset/OS)

C

blast-#.#.##-axp64-tru64.tar.gz

Compaq/HP alpha running OSF/Tru64

O

blast-#.#.##-ia32-freebsd.tar.gz

Intel Pentium PC running FreeBSD

M

blast-#.#.##-ia32-linux.tar.gz

Intel Pentium PC running Linux

M

blast-#.#.##-ia32-solaris.tar.gz

Intel Pentium PC running Solaris

A

blast-#.#.##-ia32-win32.exe

Intel Pentium PC running Windows

N

blast-#.#.##-ia64-linux.tar.gz

Intel 64-bit processor PC running Linux

D

blast-#.#.##-mips64-irix.tar.gz

SGI 64-bits

L

blast-#.#.##-ppc32-macosx.tar.gz

MacOSX, Terminal (backend BSD Unix)

I

blast-#.#.##-sparc64-solaris.tar.gz

Sun Sparc station running Solaris

N

blast-#.#.##-x64-linux.tar.gz

Linux 64-bits system

E

blast-#.#.##-x64-solaris10.tar.gz

Linux 64-bits system running Solaris 10

 

netblast-#.#.##-axp64-tru64.tar.gz

Compaq/HP alpha machine running OSF/Tru64

 

netblast-#.#.##-ia32-freebsd.tar.gz

Intel Pentium PC running FreeBSD

C

netblast-#.#.##-ia32-linux.tar.gz

Intel Pentium PC running Linux

L

netblast-#.#.##-ia32-solaris.tar.gz

Intel Pentium PC running Solaris

I

netblast-#.#.##-ia32-win32.exe

Intel Pentium PC running Windows

E

netblast-#.#.##-ia64-linux.tar.gz

Intel 64-bit processor PC running Linux

N

netblast-#.#.##-mips64-irix.tar.gz

SGI 64-bit

T

netblast-#.#.##-ppc32-macosx.tar.gz

MacOSX, Terminal (backend BSD Unix)

S

netblast-#.#.##-sparc64-solaris.tar.gz

Sun Sparc station running Solaris

 

netblast-#.#.##-x64-linux.tar.gz

Linux 64-bits system

 

netblast-#.#.##-x64-solaris10.tar.gz

Linux 64-bits system running Solaris 10

W

wwwblast-#.#.##-axp64-tru64.tar.gz

Compaq/HP alpha machine running OSF/Tru64

E

wwwblast-#.#.##-ia32-freebsd.tar.gz

Intel Pentium PC running FreeBSD

B

wwwblast-#.#.##-ia32-linux.tar.gz

Intel Pentium PC running Linux

 

wwwblast-#.#.##-ia32-solaris.tar.gz

Intel Pentium PC running Solaris

S

wwwblast-#.#.##-ia64-linux.tar.gz

Intel 64-bit processor PC running Linux

E

wwwblast-#.#.##-mips64-irix.tar.gz

SGI 64-bit

R

wwwblast-#.#.##-ppc32-macosx.tar.gz

MacOSX, Terminal (backend BSD Unix)

V

wwwblast-#.#.##-sparc64-solaris.tar.gz

Sun Sparc station running Solaris

E

wwwblast-#.#.##-x64-linux.tar.gz

Linux 64-bits system

R

wwwblast-#.#.##-x64-solaris10.tar.gz

Linux 64-bits system running Solaris 10

NOTE:

Three types of binaries are available: commandline, clients, and web server. They are grouped accordingly in the above table.

Commandline archive is for setting up blast on users’ local machine with tools for preparing databases and running all type of searches locally.

Client archive is for configure blast searches locally and sending the searches over internet to NCBI. Only common blast searches are available. It is batch capable.

 Web server blast is for setting blast web pages locally under a existing web server setup (such as Apache) and run blast searches with a graphical user interface (web page).

 
2.6 File content of ftp.ncbi.nlm.nih.gov/blast/matrices/ subdirectory
 
This directory contains the BLOSUM and PAM protein scoring matrices. Even though all the protein matrices listed can be used by BLAST, statistically BLAST can only support 5 protein score matrices. These matrices are colored in yellow.
 

2.6 File content of ftp.ncbi.nlm.nih.gov/blast/matrices/ subdirectory

BLOSUM Family of Matrices

PAM Family of Matrices

BLOSUM30

BLOSUM65

PAM10

PAM160.cdi

PAM330

BLOSUM30

BLOSUM65

PAM10

PAM160.cdi

PAM330

BLOSUM30

BLOSUM65

PAM10

PAM160.cdi

PAM330

BLOSUM30.50

BLOSUM65.50

PAM20

PAM170

PAM340

BLOSUM35

BLOSUM70

PAM30 1

PAM180

PAM350

BLOSUM35.50

BLOSUM70.50

PAM40

PAM190

PAM360

BLOSUM40

BLOSUM75

PAM40.cdi

PAM200

PAM370

BLOSUM40.50

BLOSUM75.50

PAM50

PAM200.cdi

PAM380

BLOSUM45 1

BLOSUM80 1

PAM60

PAM210

PAM390

BLOSUM45.50

BLOSUM80.50

PAM70 1

PAM220

PAM400

BLOSUM50

BLOSUM85

PAM80

PAM230

PAM410

BLOSUM50.50

BLOSUM85.50

PAM80.cdi

PAM240

PAM420

BLOSUM55

BLOSUM90

PAM90

PAM250

PAM430

BLOSUM55.50

BLOSUM90.50

PAM100

PAM250.cdi

PAM440

BLOSUM60

BLOSUM100

PAM110

PAM260

PAM450

BLOSUM60.50

BLOSUM100.50

PAM120

PAM270

PAM460

BLOSUM62 1

BLOSUMN

PAM120.cdi

PAM280

PAM470

BLOSUM62.50

BLOSUMN.50

PAM130

PAM290

PAM480

DAYHOFF

GONNET

PAM140

PAM300

PAM490

IDENTITY

MATCH

PAM150

PAM310

PAM500

NUC.4.2

NUC.4.4

PAM160

PAM320

 

NOTE:

1 Five matrices statistically supported by BLAST are colored yellow.

2 Special protein matrices are colored in light blue.

3 Two nucleotide matrices are colored in purple.

 
3. Techinical Support
 
Additional questions and/or comments on this ftp site as well as this document should be directed to NCBI User Service:
               blast-help@ncbi.nlm.nih.gov 
 
Questions on general NCBI resources should be directed to:
               info@ncbi.nlm.nih.gov