org.biojavax.bio.seq.io
Class UniProtXMLFormat

java.lang.Object
  extended by org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
      extended by org.biojavax.bio.seq.io.UniProtXMLFormat
All Implemented Interfaces:
SequenceFormat, RichSequenceFormat

public class UniProtXMLFormat
extends RichSequenceFormat.BasicFormat

Format reader for UniProtXML files. This version of UniProtXML format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.GenbankXmlFormat object. Understands http://www.ebi.uniprot.org/support/docs/uniprot.xsd

Since:
1.5
Author:
Alan Li (code based on his work), Richard Holland

Nested Class Summary
static class UniProtXMLFormat.Terms
          Implements some UniProtXML-specific terms.
 
Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
 
Field Summary
protected static String ACCESSION_TAG
           
protected static String AUTHOR_LIST_TAG
           
protected static String CITATION_TAG
           
protected static String COMMENT_ABS_MAX_TAG
           
protected static String COMMENT_ABSORPTION_TAG
           
protected static String COMMENT_ERROR_ATTR
           
protected static String COMMENT_EVENT_TAG
           
protected static String COMMENT_EXPERIMENTS_TAG
           
protected static String COMMENT_INTERACT_INTACT_ATTR
           
protected static String COMMENT_INTERACT_LABEL_TAG
           
protected static String COMMENT_INTERACTANT_TAG
           
protected static String COMMENT_ISOFORM_TAG
           
protected static String COMMENT_KIN_KM_TAG
           
protected static String COMMENT_KIN_VMAX_TAG
           
protected static String COMMENT_KINETICS_TAG
           
protected static String COMMENT_LINK_TAG
           
protected static String COMMENT_LINK_URI_ATTR
           
protected static String COMMENT_LOCTYPE_ATTR
           
protected static String COMMENT_MASS_ATTR
           
protected static String COMMENT_METHOD_ATTR
           
protected static String COMMENT_ORGANISMS_TAG
           
protected static String COMMENT_PH_TAG
           
protected static String COMMENT_REDOX_TAG
           
protected static String COMMENT_TAG
           
protected static String COMMENT_TEMPERATURE_TAG
           
protected static String COMPONENT_TAG
           
protected static String CONSORTIUM_TAG
           
protected static String COPYRIGHT_TAG
           
protected static String DBXREF_TAG
           
protected static String DOMAIN_TAG
           
protected static String EDITOR_LIST_TAG
           
protected static String ENTRY_CREATED_ATTR
           
protected static String ENTRY_GROUP_TAG
           
protected static String ENTRY_NAMESPACE_ATTR
           
protected static String ENTRY_TAG
           
protected static String ENTRY_UPDATED_ATTR
           
protected static String ENTRY_VERSION_ATTR
           
protected static String EVIDENCE_ATTR
           
protected static String EVIDENCE_ATTRIBUTE_ATTR
           
protected static String EVIDENCE_CATEGORY_ATTR
           
protected static String EVIDENCE_DATE_ATTR
           
protected static String EVIDENCE_TAG
           
protected static String FEATURE_DESC_ATTR
           
protected static String FEATURE_ORIGINAL_TAG
           
protected static String FEATURE_TAG
           
protected static String FEATURE_VARIATION_TAG
           
protected static String GENE_TAG
           
protected static String GENELOCATION_NAME_TAG
           
protected static String GENELOCATION_TAG
           
protected static String ID_ATTR
           
protected static String ID_TAG
           
protected static String KEY_ATTR
           
protected static String KEYWORD_TAG
           
protected static String LINEAGE_TAG
           
protected static String LOCATION_BEGIN_TAG
           
protected static String LOCATION_END_TAG
           
protected static String LOCATION_POSITION_ATTR
           
protected static String LOCATION_POSITION_TAG
           
protected static String LOCATION_SEQ_ATTR
           
protected static String LOCATION_TAG
           
protected static String LOCATOR_TAG
           
protected static String NAME_ATTR
           
protected static String NAME_TAG
           
protected static String NOTE_TAG
           
protected static String ORGANISM_TAG
           
protected static String PERSON_TAG
           
protected static String PROPERTY_TAG
           
protected static String PROTEIN_EXISTS_TAG
           
protected static String PROTEIN_TAG
           
protected static String PROTEIN_TYPE_ATTR
           
protected static String RC_LINE_TAG
           
protected static String RC_PLASMID_TAG
           
protected static String RC_SPECIES_TAG
           
protected static String RC_STRAIN_TAG
           
protected static String RC_TISSUE_TAG
           
protected static String RC_TRANSP_TAG
           
protected static String REF_ATTR
           
protected static String REFERENCE_TAG
           
protected static String RP_LINE_TAG
           
protected static Pattern rppat
           
protected static String SEQUENCE_CHECKSUM_ATTR
           
protected static String SEQUENCE_LENGTH_ATTR
           
protected static String SEQUENCE_MASS_ATTR
           
protected static String SEQUENCE_MODIFIED_ATTR
           
protected static String SEQUENCE_TAG
           
protected static String SEQUENCE_VERSION_ATTR
           
protected static String STATUS_ATTR
           
protected static String TAXON_TAG
           
protected static String TEXT_TAG
           
protected static String TITLE_TAG
           
protected static String TYPE_ATTR
           
static String UNIPROTXML_FORMAT
          The name of this format
protected static String VALUE_ATTR
           
protected static Pattern xmlSchema
           
 
Constructor Summary
UniProtXMLFormat()
           
 
Method Summary
 void beginWriting()
          Informs the writer that we want to start writing.
 boolean canRead(BufferedInputStream stream)
          Check to see if a given stream is in our format. A stream is in UniProtXML format if the second XML line contains the phrase "http://www.uniprot.org/support/docs/uniprot.xsd".
 boolean canRead(File file)
          Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in UniProtXML format if the second XML line contains the phrase "http://www.uniprot.org/support/docs/uniprot.xsd".
 void finishWriting()
          Informs the writer that are done writing.
 String getDefaultFormat()
          getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.
 SymbolTokenization guessSymbolTokenization(BufferedInputStream stream)
          On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. Always returns a protein tokenizer.
 SymbolTokenization guessSymbolTokenization(File file)
          On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.
 boolean readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)
          Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. If namespace is null, then the namespace of the sequence in the fasta is used.
 boolean readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)
          Read a sequence and pass data on to a SeqIOListener.
 void writeSequence(Sequence seq, Namespace ns)
          Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is null, then the sequence's own namespace is used.
 void writeSequence(Sequence seq, PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the default format.
 void writeSequence(Sequence seq, String format, PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the specified format.
 
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNIPROTXML_FORMAT

public static final String UNIPROTXML_FORMAT
The name of this format

See Also:
Constant Field Values

ENTRY_GROUP_TAG

protected static final String ENTRY_GROUP_TAG
See Also:
Constant Field Values

ENTRY_TAG

protected static final String ENTRY_TAG
See Also:
Constant Field Values

ENTRY_VERSION_ATTR

protected static final String ENTRY_VERSION_ATTR
See Also:
Constant Field Values

ENTRY_NAMESPACE_ATTR

protected static final String ENTRY_NAMESPACE_ATTR
See Also:
Constant Field Values

ENTRY_CREATED_ATTR

protected static final String ENTRY_CREATED_ATTR
See Also:
Constant Field Values

ENTRY_UPDATED_ATTR

protected static final String ENTRY_UPDATED_ATTR
See Also:
Constant Field Values

COPYRIGHT_TAG

protected static final String COPYRIGHT_TAG
See Also:
Constant Field Values

ACCESSION_TAG

protected static final String ACCESSION_TAG
See Also:
Constant Field Values

NAME_TAG

protected static final String NAME_TAG
See Also:
Constant Field Values

TEXT_TAG

protected static final String TEXT_TAG
See Also:
Constant Field Values

REF_ATTR

protected static final String REF_ATTR
See Also:
Constant Field Values

TYPE_ATTR

protected static final String TYPE_ATTR
See Also:
Constant Field Values

KEY_ATTR

protected static final String KEY_ATTR
See Also:
Constant Field Values

ID_ATTR

protected static final String ID_ATTR
See Also:
Constant Field Values

EVIDENCE_ATTR

protected static final String EVIDENCE_ATTR
See Also:
Constant Field Values

VALUE_ATTR

protected static final String VALUE_ATTR
See Also:
Constant Field Values

STATUS_ATTR

protected static final String STATUS_ATTR
See Also:
Constant Field Values

NAME_ATTR

protected static final String NAME_ATTR
See Also:
Constant Field Values

PROTEIN_TAG

protected static final String PROTEIN_TAG
See Also:
Constant Field Values

PROTEIN_TYPE_ATTR

protected static final String PROTEIN_TYPE_ATTR
See Also:
Constant Field Values

DOMAIN_TAG

protected static final String DOMAIN_TAG
See Also:
Constant Field Values

COMPONENT_TAG

protected static final String COMPONENT_TAG
See Also:
Constant Field Values

GENE_TAG

protected static final String GENE_TAG
See Also:
Constant Field Values

ORGANISM_TAG

protected static final String ORGANISM_TAG
See Also:
Constant Field Values

DBXREF_TAG

protected static final String DBXREF_TAG
See Also:
Constant Field Values

PROPERTY_TAG

protected static final String PROPERTY_TAG
See Also:
Constant Field Values

LINEAGE_TAG

protected static final String LINEAGE_TAG
See Also:
Constant Field Values

TAXON_TAG

protected static final String TAXON_TAG
See Also:
Constant Field Values

GENELOCATION_TAG

protected static final String GENELOCATION_TAG
See Also:
Constant Field Values

GENELOCATION_NAME_TAG

protected static final String GENELOCATION_NAME_TAG
See Also:
Constant Field Values

REFERENCE_TAG

protected static final String REFERENCE_TAG
See Also:
Constant Field Values

CITATION_TAG

protected static final String CITATION_TAG
See Also:
Constant Field Values

TITLE_TAG

protected static final String TITLE_TAG
See Also:
Constant Field Values

EDITOR_LIST_TAG

protected static final String EDITOR_LIST_TAG
See Also:
Constant Field Values

AUTHOR_LIST_TAG

protected static final String AUTHOR_LIST_TAG
See Also:
Constant Field Values

PERSON_TAG

protected static final String PERSON_TAG
See Also:
Constant Field Values

CONSORTIUM_TAG

protected static final String CONSORTIUM_TAG
See Also:
Constant Field Values

LOCATOR_TAG

protected static final String LOCATOR_TAG
See Also:
Constant Field Values

RP_LINE_TAG

protected static final String RP_LINE_TAG
See Also:
Constant Field Values

RC_LINE_TAG

protected static final String RC_LINE_TAG
See Also:
Constant Field Values

RC_SPECIES_TAG

protected static final String RC_SPECIES_TAG
See Also:
Constant Field Values

RC_TISSUE_TAG

protected static final String RC_TISSUE_TAG
See Also:
Constant Field Values

RC_TRANSP_TAG

protected static final String RC_TRANSP_TAG
See Also:
Constant Field Values

RC_STRAIN_TAG

protected static final String RC_STRAIN_TAG
See Also:
Constant Field Values

RC_PLASMID_TAG

protected static final String RC_PLASMID_TAG
See Also:
Constant Field Values

COMMENT_TAG

protected static final String COMMENT_TAG
See Also:
Constant Field Values

COMMENT_MASS_ATTR

protected static final String COMMENT_MASS_ATTR
See Also:
Constant Field Values

COMMENT_ERROR_ATTR

protected static final String COMMENT_ERROR_ATTR
See Also:
Constant Field Values

COMMENT_METHOD_ATTR

protected static final String COMMENT_METHOD_ATTR
See Also:
Constant Field Values

COMMENT_LOCTYPE_ATTR

protected static final String COMMENT_LOCTYPE_ATTR
See Also:
Constant Field Values

COMMENT_ABSORPTION_TAG

protected static final String COMMENT_ABSORPTION_TAG
See Also:
Constant Field Values

COMMENT_ABS_MAX_TAG

protected static final String COMMENT_ABS_MAX_TAG
See Also:
Constant Field Values

COMMENT_KINETICS_TAG

protected static final String COMMENT_KINETICS_TAG
See Also:
Constant Field Values

COMMENT_KIN_KM_TAG

protected static final String COMMENT_KIN_KM_TAG
See Also:
Constant Field Values

COMMENT_KIN_VMAX_TAG

protected static final String COMMENT_KIN_VMAX_TAG
See Also:
Constant Field Values

COMMENT_PH_TAG

protected static final String COMMENT_PH_TAG
See Also:
Constant Field Values

COMMENT_REDOX_TAG

protected static final String COMMENT_REDOX_TAG
See Also:
Constant Field Values

COMMENT_TEMPERATURE_TAG

protected static final String COMMENT_TEMPERATURE_TAG
See Also:
Constant Field Values

COMMENT_LINK_TAG

protected static final String COMMENT_LINK_TAG
See Also:
Constant Field Values

COMMENT_LINK_URI_ATTR

protected static final String COMMENT_LINK_URI_ATTR
See Also:
Constant Field Values

COMMENT_EVENT_TAG

protected static final String COMMENT_EVENT_TAG
See Also:
Constant Field Values

COMMENT_ISOFORM_TAG

protected static final String COMMENT_ISOFORM_TAG
See Also:
Constant Field Values

COMMENT_INTERACTANT_TAG

protected static final String COMMENT_INTERACTANT_TAG
See Also:
Constant Field Values

COMMENT_INTERACT_INTACT_ATTR

protected static final String COMMENT_INTERACT_INTACT_ATTR
See Also:
Constant Field Values

COMMENT_INTERACT_LABEL_TAG

protected static final String COMMENT_INTERACT_LABEL_TAG
See Also:
Constant Field Values

COMMENT_ORGANISMS_TAG

protected static final String COMMENT_ORGANISMS_TAG
See Also:
Constant Field Values

COMMENT_EXPERIMENTS_TAG

protected static final String COMMENT_EXPERIMENTS_TAG
See Also:
Constant Field Values

NOTE_TAG

protected static final String NOTE_TAG
See Also:
Constant Field Values

KEYWORD_TAG

protected static final String KEYWORD_TAG
See Also:
Constant Field Values

PROTEIN_EXISTS_TAG

protected static final String PROTEIN_EXISTS_TAG
See Also:
Constant Field Values

ID_TAG

protected static final String ID_TAG
See Also:
Constant Field Values

FEATURE_TAG

protected static final String FEATURE_TAG
See Also:
Constant Field Values

FEATURE_DESC_ATTR

protected static final String FEATURE_DESC_ATTR
See Also:
Constant Field Values

FEATURE_ORIGINAL_TAG

protected static final String FEATURE_ORIGINAL_TAG
See Also:
Constant Field Values

FEATURE_VARIATION_TAG

protected static final String FEATURE_VARIATION_TAG
See Also:
Constant Field Values

EVIDENCE_TAG

protected static final String EVIDENCE_TAG
See Also:
Constant Field Values

EVIDENCE_CATEGORY_ATTR

protected static final String EVIDENCE_CATEGORY_ATTR
See Also:
Constant Field Values

EVIDENCE_ATTRIBUTE_ATTR

protected static final String EVIDENCE_ATTRIBUTE_ATTR
See Also:
Constant Field Values

EVIDENCE_DATE_ATTR

protected static final String EVIDENCE_DATE_ATTR
See Also:
Constant Field Values

LOCATION_TAG

protected static final String LOCATION_TAG
See Also:
Constant Field Values

LOCATION_SEQ_ATTR

protected static final String LOCATION_SEQ_ATTR
See Also:
Constant Field Values

LOCATION_BEGIN_TAG

protected static final String LOCATION_BEGIN_TAG
See Also:
Constant Field Values

LOCATION_END_TAG

protected static final String LOCATION_END_TAG
See Also:
Constant Field Values

LOCATION_POSITION_ATTR

protected static final String LOCATION_POSITION_ATTR
See Also:
Constant Field Values

LOCATION_POSITION_TAG

protected static final String LOCATION_POSITION_TAG
See Also:
Constant Field Values

SEQUENCE_TAG

protected static final String SEQUENCE_TAG
See Also:
Constant Field Values

SEQUENCE_VERSION_ATTR

protected static final String SEQUENCE_VERSION_ATTR
See Also:
Constant Field Values

SEQUENCE_LENGTH_ATTR

protected static final String SEQUENCE_LENGTH_ATTR
See Also:
Constant Field Values

SEQUENCE_MASS_ATTR

protected static final String SEQUENCE_MASS_ATTR
See Also:
Constant Field Values

SEQUENCE_CHECKSUM_ATTR

protected static final String SEQUENCE_CHECKSUM_ATTR
See Also:
Constant Field Values

SEQUENCE_MODIFIED_ATTR

protected static final String SEQUENCE_MODIFIED_ATTR
See Also:
Constant Field Values

rppat

protected static final Pattern rppat

xmlSchema

protected static final Pattern xmlSchema
Constructor Detail

UniProtXMLFormat

public UniProtXMLFormat()
Method Detail

canRead

public boolean canRead(File file)
                throws IOException
Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in UniProtXML format if the second XML line contains the phrase "http://www.uniprot.org/support/docs/uniprot.xsd".

Specified by:
canRead in interface RichSequenceFormat
Overrides:
canRead in class RichSequenceFormat.BasicFormat
Parameters:
file - the File to check.
Returns:
true if the file is readable by this format, false if not.
Throws:
IOException - in case the file is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(File file)
                                           throws IOException
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.

Specified by:
guessSymbolTokenization in interface RichSequenceFormat
Overrides:
guessSymbolTokenization in class RichSequenceFormat.BasicFormat
Parameters:
file - the File object to guess the format of.
Returns:
a SymbolTokenization to read the file with.
Throws:
IOException - if the file is unrecognisable or inaccessible.

canRead

public boolean canRead(BufferedInputStream stream)
                throws IOException
Check to see if a given stream is in our format. A stream is in UniProtXML format if the second XML line contains the phrase "http://www.uniprot.org/support/docs/uniprot.xsd".

Parameters:
stream - the BufferedInputStream to check.
Returns:
true if the stream is readable by this format, false if not.
Throws:
IOException - in case the stream is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(BufferedInputStream stream)
                                           throws IOException
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.

Parameters:
stream - the BufferedInputStream object to guess the format of.
Returns:
a SymbolTokenization to read the stream with.
Throws:
IOException - if the stream is unrecognisable or inaccessible.

readSequence

public boolean readSequence(BufferedReader reader,
                            SymbolTokenization symParser,
                            SeqIOListener listener)
                     throws IllegalSymbolException,
                            IOException,
                            ParseException
Read a sequence and pass data on to a SeqIOListener.

Parameters:
reader - The stream of data to parse.
symParser - A SymbolParser defining a mapping from character data to Symbols.
listener - A listener to notify when data is extracted from the stream.
Returns:
a boolean indicating whether or not the stream contains any more sequences.
Throws:
IllegalSymbolException - if it is not possible to translate character data from the stream into valid BioJava symbols.
IOException - if an error occurs while reading from the stream.
ParseException

readRichSequence

public boolean readRichSequence(BufferedReader reader,
                                SymbolTokenization symParser,
                                RichSeqIOListener rlistener,
                                Namespace ns)
                         throws IllegalSymbolException,
                                IOException,
                                ParseException
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface. If namespace is null, then the namespace of the sequence in the fasta is used. If the namespace is null and so is the namespace of the sequence in the fasta, then the default namespace is used.

Parameters:
reader - the input source
symParser - the tokenizer which understands the sequence being read
rlistener - the listener to send sequence events to
ns - the namespace to read sequences into.
Returns:
true if there is more to read after this, false otherwise.
Throws:
IllegalSymbolException - if the tokenizer couldn't understand one of the sequence symbols in the file.
IOException - if there was a read error.
ParseException

beginWriting

public void beginWriting()
                  throws IOException
Informs the writer that we want to start writing. This will do any initialisation required, such as writing the opening tags of an XML file that groups sequences together.

Throws:
IOException - if writing fails.

finishWriting

public void finishWriting()
                   throws IOException
Informs the writer that are done writing. This will do any finalisation required, such as writing the closing tags of an XML file that groups sequences together.

Throws:
IOException - if writing fails.

writeSequence

public void writeSequence(Sequence seq,
                          PrintStream os)
                   throws IOException
writeSequence writes a sequence to the specified PrintStream, using the default format.

Parameters:
seq - the sequence to write out.
os - the printstream to write to.
Throws:
IOException

writeSequence

public void writeSequence(Sequence seq,
                          String format,
                          PrintStream os)
                   throws IOException
writeSequence writes a sequence to the specified PrintStream, using the specified format.

Parameters:
seq - a Sequence to write out.
format - a String indicating which sub-format of those available from a particular SequenceFormat implemention to use when writing.
os - a PrintStream object.
Throws:
IOException - if an error occurs.

writeSequence

public void writeSequence(Sequence seq,
                          Namespace ns)
                   throws IOException
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! If namespace is null, then the sequence's own namespace is used.

Parameters:
seq - the sequence to write
ns - the namespace to write it with
Throws:
IOException - in case it couldn't write something

getDefaultFormat

public String getDefaultFormat()
getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.

Returns:
a String.