scope

 

Function

Convert raw scop classification file to embl-like format

Description

Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. A knowledge of these relationships is crucial to our understanding of the evolution of proteins and of development. It will also play an important role in the analysis of the sequence data that is being produced by worldwide genome projects.

The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the Protein Data Bank (PDB).

scope reads the SCOP classification file available at http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?dir=lin

scope writes the SCOP classification to an EMBL-like format file.

No changes are made to the data other than changing the format in which it is held.

This EMBL-like format SCOP file is used by several other EMBOSS programs.

The reason why the SCOP database format is changed to an EMBL-like format before being used used by other EMBOSS programs is that it is an easier format to work with than the native SCOP database format.

Usage

Here is a sample session with scope:


% scope
Convert raw scop classification file to embl-like format
Name of scop file for input (raw format) [scop.orig]: /data/scop/scop.orig
Name of scop file for output (embl-like format) [Escop.dat]: Escop.test

Command line arguments

   Mandatory qualifiers:
  [-infile]            infile     Name of scop file for input (raw format)
  [-outfile]           outfile    Name of scop file for output (embl-like
                                  format)

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-infile]
(Parameter 1)
Name of scop file for input (raw format) Input file scop.orig
[-outfile]
(Parameter 2)
Name of scop file for output (embl-like format) Output file Escop.dat
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

The native format SCOP database input file is available at http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?dir=lin

The format of this file is explained at http://scop.mrc-lmb.cam.ac.uk/scop/parindex.html

The file given at this URL contains a single line for each domain in SCOP, including text describing the position of the domain in the SCOP hierarchy. Note that other SCOP classification files, without this annotation, are available at http://scop.mrc-lmb.cam.ac.uk/scop/parindex.html

Output file format

The output records used to describe an entry are given below. Records (4) to (8) are used to describe the position of the domain in the scop hierarchy.

  1. ID - Domain identifier code. This is a 7-character code that uniquely identifies the domain in scop. It is identical to the first 7 characters of a line in the scop classification file. The first character is always 'D', the next four characters are the PDB identifier code, the fifth character is the PDB chain identifier to which the domain belongs (a '.' is given in cases where the domain is composed of multiple chains, a '_' is given where a chain identifier was not specified in the PDB file) and the final character is the number of the domain in the chain (for chains comprising more than one domain) or '_' (the chain comprises a single domain only).
  2. EN - PDB identifier code. This is the 4-character PDB identifier code of the PDB entry containing the domain.
  3. OS - Source of the protein. It is identical to the text given after 'Species' in the scop classification file.
  4. CL - Domain class. It is identical to the text given after 'Class' in the scop classification file.
  5. FO - Domain fold. It is identical to the text given after 'Fold' in the scop classification file.
  6. SF - Domain superfamily. It is identical to the text given after 'Superfamily' in the scop classification file.
  7. FA - Domain family. It is identical to the text given after 'Family' in the scop classification file.
  8. DO - Domain name. It is identical to the text given after 'Protein' in the scop classification file.
  9. NC - Number of chains comprising the domain (usually 1). If the number of chains is greater than 1, then the domain entry will have a section containing a CN and a CH record (see below) for each chain.
  10. CN - Chain number. The number given in brackets after this record indicates the start of the data for the relevent chain.
  11. CH - Domain definition. The character given before CHAIN is the PDB chain identifier (a '.' is given in cases where a chain identifier was not specified in the scop classification file), the strings before START and END give the start and end positions respectively of the domain in the PDB file (a '.' is given in cases where a position was not specified). Note that the start and end positions refer to residue numbering given in the original pdb file and therefore must be treated as strings.

An example of an excerpt from an output file follows:


ID   D3SDHA_
XX
EN   3SDH
XX
OS   Ark clam (Scapharca inaequivalvis)
XX
CL   All alpha proteins
XX
FO   Globin-like
XX
SF   Globin-like
XX
FA   Globins
XX
DO   Hemoglobin I
XX
NC   1
XX
CN   [1]
XX
CH   a CHAIN; . START; . END;
//
ID   D3SDHB_
XX
EN   3SDH
XX
OS   Ark clam (Scapharca inaequivalvis)
XX
CL   All alpha proteins
XX
FO   Globin-like
XX
SF   Globin-like
XX
FA   Globins
XX
DO   Hemoglobin I
XX
NC   1
XX
CN   [1]
XX
CH   b CHAIN; . START; . END;
//

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
aaindexextractExtract data from AAINDEX
cutgextractExtract data from CUTG
printsextractExtract data from PRINTS
prosextractBuilds the PROSITE motif database for patmatmotifs to search
rebaseextractExtract data from REBASE
tfextractExtract data from TRANSFAC

Author(s)

This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)

History

Written (Jan 2001) - Jon Ison.

Target users

This program is intended to be run by EMBOSS site maintainers or those responsible for setting up and maintaining protein 3D structural data for use by others.

Comments