org.biojavax.bio.seq.io
Class EMBLxmlFormat

java.lang.Object
  extended by org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
      extended by org.biojavax.bio.seq.io.EMBLxmlFormat
All Implemented Interfaces:
SequenceFormat, RichSequenceFormat

public class EMBLxmlFormat
extends RichSequenceFormat.BasicFormat

Format reader for EMBLxml files. This version of EMBLxml format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.GenbankXmlFormat object. Understands http://www.ebi.ac.uk/embl/Documentation/DTD/EMBL_dtd.txt

Since:
1.5
Author:
Alan Li (code based on his work), Richard Holland

Nested Class Summary
static class EMBLxmlFormat.Terms
          Implements some EMBLxml-specific terms.
 
Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
 
Field Summary
protected static java.lang.String AUTHOR_TAG
           
protected static java.lang.String BASEPOSITION_EXTENT_ATTR
           
protected static java.lang.String BASEPOSITION_TAG
           
protected static java.lang.String BASEPOSITION_TYPE_ATTR
           
protected static java.lang.String CITATION_DATE_ATTR
           
protected static java.lang.String CITATION_FIRST_ATTR
           
protected static java.lang.String CITATION_ID_ATTR
           
protected static java.lang.String CITATION_INSTITUTE_ATTR
           
protected static java.lang.String CITATION_ISSUE_ATTR
           
protected static java.lang.String CITATION_LAST_ATTR
           
protected static java.lang.String CITATION_NAME_ATTR
           
protected static java.lang.String CITATION_PATENT_ATTR
           
protected static java.lang.String CITATION_PUB_ATTR
           
protected static java.lang.String CITATION_TAG
           
protected static java.lang.String CITATION_TYPE_ATTR
           
protected static java.lang.String CITATION_VOL_ATTR
           
protected static java.lang.String CITATION_YEAR_ATTR
           
protected static java.lang.String COMMENT_TAG
           
protected static java.lang.String COMNAME_TAG
           
protected static java.lang.String CONSORTIUM_TAG
           
protected static java.lang.String DBREF_DB_ATTR
           
protected static java.lang.String DBREF_PRIMARY_ATTR
           
protected static java.lang.String DBREF_SEC_ATTR
           
protected static java.lang.String DBREFERENCE_TAG
           
protected static java.lang.String DESC_TAG
           
protected static java.lang.String EDITOR_TAG
           
static java.lang.String EMBLXML_FORMAT
          The name of this format
protected static java.lang.String ENTRY_ACCESSION_ATTR
           
protected static java.lang.String ENTRY_CREATED_ATTR
           
protected static java.lang.String ENTRY_DIVISION_ATTR
           
protected static java.lang.String ENTRY_GROUP_TAG
           
protected static java.lang.String ENTRY_NAME_ATTR
           
protected static java.lang.String ENTRY_RELCREATED_ATTR
           
protected static java.lang.String ENTRY_RELUPDATED_ATTR
           
protected static java.lang.String ENTRY_TAG
           
protected static java.lang.String ENTRY_UPDATED_ATTR
           
protected static java.lang.String ENTRY_VER_ATTR
           
protected static java.lang.String FEATURE_NAME_ATTR
           
protected static java.lang.String FEATURE_TAG
           
protected static java.lang.String KEYWORD_TAG
           
protected static java.lang.String LINEAGE_TAG
           
protected static java.lang.String LOC_ELEMENT_ACC_ATTR
           
protected static java.lang.String LOC_ELEMENT_COMPL_ATTR
           
protected static java.lang.String LOC_ELEMENT_TYPE_ATTR
           
protected static java.lang.String LOC_ELEMENT_VER_ATTR
           
protected static java.lang.String LOCATION_COMPL_ATTR
           
protected static java.lang.String LOCATION_ELEMENT_TAG
           
protected static java.lang.String LOCATION_TAG
           
protected static java.lang.String LOCATION_TYPE_ATTR
           
protected static java.lang.String LOCATOR_TAG
           
protected static java.lang.String NAMESET_TAG
           
protected static java.lang.String ORGANISM_TAG
           
protected static java.lang.String PATENT_TAG
           
protected static java.lang.String QUALIFIER_NAME_ATTR
           
protected static java.lang.String QUALIFIER_TAG
           
protected static java.lang.String REF_POS_BEGIN_ATTR
           
protected static java.lang.String REF_POS_END_ATTR
           
protected static java.lang.String REFERENCE_POSITION_TAG
           
protected static java.lang.String REFERENCE_TAG
           
protected static java.lang.String SCINAME_TAG
           
protected static java.lang.String SEC_ACC_TAG
           
protected static java.lang.String SEQUENCE_LENGTH_ATTR
           
protected static java.lang.String SEQUENCE_TAG
           
protected static java.lang.String SEQUENCE_TOPOLOGY_ATTR
           
protected static java.lang.String SEQUENCE_TYPE_ATTR
           
protected static java.lang.String SEQUENCE_VER_ATTR
           
protected static java.lang.String TAXID_TAG
           
protected static java.lang.String TAXON_TAG
           
protected static java.lang.String TITLE_TAG
           
protected static java.util.regex.Pattern xmlSchema
           
 
Constructor Summary
EMBLxmlFormat()
           
 
Method Summary
 void beginWriting()
          Informs the writer that we want to start writing.
 boolean canRead(java.io.BufferedInputStream stream)
          Check to see if a given stream is in our format. A stream is in EMBLxml format if the second XML line contains the phrase "http://www.ebi.ac.uk/schema/EMBL_schema.xsd".
 boolean canRead(java.io.File file)
          Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in EMBLxml format if the second XML line contains the phrase "http://www.ebi.ac.uk/schema/EMBL_schema.xsd".
 void finishWriting()
          Informs the writer that are done writing.
 java.lang.String getDefaultFormat()
          getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.
 SymbolTokenization guessSymbolTokenization(java.io.BufferedInputStream stream)
          On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. Always returns a DNA tokenizer.
 SymbolTokenization guessSymbolTokenization(java.io.File file)
          On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.
 boolean readRichSequence(java.io.BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)
          Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.
 boolean readSequence(java.io.BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)
          Read a sequence and pass data on to a SeqIOListener.
 void writeSequence(Sequence seq, Namespace ns)
          Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. Namespace is ignored as EMBLxml has no concept of it.
 void writeSequence(Sequence seq, java.io.PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the default format.
 void writeSequence(Sequence seq, java.lang.String format, java.io.PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the specified format.
 
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EMBLXML_FORMAT

public static final java.lang.String EMBLXML_FORMAT
The name of this format

See Also:
Constant Field Values

ENTRY_GROUP_TAG

protected static final java.lang.String ENTRY_GROUP_TAG
See Also:
Constant Field Values

ENTRY_TAG

protected static final java.lang.String ENTRY_TAG
See Also:
Constant Field Values

ENTRY_ACCESSION_ATTR

protected static final java.lang.String ENTRY_ACCESSION_ATTR
See Also:
Constant Field Values

ENTRY_NAME_ATTR

protected static final java.lang.String ENTRY_NAME_ATTR
See Also:
Constant Field Values

ENTRY_DIVISION_ATTR

protected static final java.lang.String ENTRY_DIVISION_ATTR
See Also:
Constant Field Values

ENTRY_CREATED_ATTR

protected static final java.lang.String ENTRY_CREATED_ATTR
See Also:
Constant Field Values

ENTRY_RELCREATED_ATTR

protected static final java.lang.String ENTRY_RELCREATED_ATTR
See Also:
Constant Field Values

ENTRY_UPDATED_ATTR

protected static final java.lang.String ENTRY_UPDATED_ATTR
See Also:
Constant Field Values

ENTRY_RELUPDATED_ATTR

protected static final java.lang.String ENTRY_RELUPDATED_ATTR
See Also:
Constant Field Values

ENTRY_VER_ATTR

protected static final java.lang.String ENTRY_VER_ATTR
See Also:
Constant Field Values

SEC_ACC_TAG

protected static final java.lang.String SEC_ACC_TAG
See Also:
Constant Field Values

DESC_TAG

protected static final java.lang.String DESC_TAG
See Also:
Constant Field Values

KEYWORD_TAG

protected static final java.lang.String KEYWORD_TAG
See Also:
Constant Field Values

REFERENCE_TAG

protected static final java.lang.String REFERENCE_TAG
See Also:
Constant Field Values

CITATION_TAG

protected static final java.lang.String CITATION_TAG
See Also:
Constant Field Values

CITATION_ID_ATTR

protected static final java.lang.String CITATION_ID_ATTR
See Also:
Constant Field Values

CITATION_TYPE_ATTR

protected static final java.lang.String CITATION_TYPE_ATTR
See Also:
Constant Field Values

CITATION_DATE_ATTR

protected static final java.lang.String CITATION_DATE_ATTR
See Also:
Constant Field Values

CITATION_NAME_ATTR

protected static final java.lang.String CITATION_NAME_ATTR
See Also:
Constant Field Values

CITATION_VOL_ATTR

protected static final java.lang.String CITATION_VOL_ATTR
See Also:
Constant Field Values

CITATION_ISSUE_ATTR

protected static final java.lang.String CITATION_ISSUE_ATTR
See Also:
Constant Field Values

CITATION_FIRST_ATTR

protected static final java.lang.String CITATION_FIRST_ATTR
See Also:
Constant Field Values

CITATION_LAST_ATTR

protected static final java.lang.String CITATION_LAST_ATTR
See Also:
Constant Field Values

CITATION_PUB_ATTR

protected static final java.lang.String CITATION_PUB_ATTR
See Also:
Constant Field Values

CITATION_PATENT_ATTR

protected static final java.lang.String CITATION_PATENT_ATTR
See Also:
Constant Field Values

CITATION_INSTITUTE_ATTR

protected static final java.lang.String CITATION_INSTITUTE_ATTR
See Also:
Constant Field Values

CITATION_YEAR_ATTR

protected static final java.lang.String CITATION_YEAR_ATTR
See Also:
Constant Field Values

DBREFERENCE_TAG

protected static final java.lang.String DBREFERENCE_TAG
See Also:
Constant Field Values

DBREF_DB_ATTR

protected static final java.lang.String DBREF_DB_ATTR
See Also:
Constant Field Values

DBREF_PRIMARY_ATTR

protected static final java.lang.String DBREF_PRIMARY_ATTR
See Also:
Constant Field Values

DBREF_SEC_ATTR

protected static final java.lang.String DBREF_SEC_ATTR
See Also:
Constant Field Values

CONSORTIUM_TAG

protected static final java.lang.String CONSORTIUM_TAG
See Also:
Constant Field Values

TITLE_TAG

protected static final java.lang.String TITLE_TAG
See Also:
Constant Field Values

EDITOR_TAG

protected static final java.lang.String EDITOR_TAG
See Also:
Constant Field Values

AUTHOR_TAG

protected static final java.lang.String AUTHOR_TAG
See Also:
Constant Field Values

PATENT_TAG

protected static final java.lang.String PATENT_TAG
See Also:
Constant Field Values

LOCATOR_TAG

protected static final java.lang.String LOCATOR_TAG
See Also:
Constant Field Values

REFERENCE_POSITION_TAG

protected static final java.lang.String REFERENCE_POSITION_TAG
See Also:
Constant Field Values

REF_POS_BEGIN_ATTR

protected static final java.lang.String REF_POS_BEGIN_ATTR
See Also:
Constant Field Values

REF_POS_END_ATTR

protected static final java.lang.String REF_POS_END_ATTR
See Also:
Constant Field Values

COMMENT_TAG

protected static final java.lang.String COMMENT_TAG
See Also:
Constant Field Values

FEATURE_TAG

protected static final java.lang.String FEATURE_TAG
See Also:
Constant Field Values

FEATURE_NAME_ATTR

protected static final java.lang.String FEATURE_NAME_ATTR
See Also:
Constant Field Values

ORGANISM_TAG

protected static final java.lang.String ORGANISM_TAG
See Also:
Constant Field Values

NAMESET_TAG

protected static final java.lang.String NAMESET_TAG
See Also:
Constant Field Values

SCINAME_TAG

protected static final java.lang.String SCINAME_TAG
See Also:
Constant Field Values

COMNAME_TAG

protected static final java.lang.String COMNAME_TAG
See Also:
Constant Field Values

TAXID_TAG

protected static final java.lang.String TAXID_TAG
See Also:
Constant Field Values

LINEAGE_TAG

protected static final java.lang.String LINEAGE_TAG
See Also:
Constant Field Values

TAXON_TAG

protected static final java.lang.String TAXON_TAG
See Also:
Constant Field Values

QUALIFIER_TAG

protected static final java.lang.String QUALIFIER_TAG
See Also:
Constant Field Values

QUALIFIER_NAME_ATTR

protected static final java.lang.String QUALIFIER_NAME_ATTR
See Also:
Constant Field Values

LOCATION_TAG

protected static final java.lang.String LOCATION_TAG
See Also:
Constant Field Values

LOCATION_TYPE_ATTR

protected static final java.lang.String LOCATION_TYPE_ATTR
See Also:
Constant Field Values

LOCATION_COMPL_ATTR

protected static final java.lang.String LOCATION_COMPL_ATTR
See Also:
Constant Field Values

LOCATION_ELEMENT_TAG

protected static final java.lang.String LOCATION_ELEMENT_TAG
See Also:
Constant Field Values

LOC_ELEMENT_TYPE_ATTR

protected static final java.lang.String LOC_ELEMENT_TYPE_ATTR
See Also:
Constant Field Values

LOC_ELEMENT_ACC_ATTR

protected static final java.lang.String LOC_ELEMENT_ACC_ATTR
See Also:
Constant Field Values

LOC_ELEMENT_VER_ATTR

protected static final java.lang.String LOC_ELEMENT_VER_ATTR
See Also:
Constant Field Values

LOC_ELEMENT_COMPL_ATTR

protected static final java.lang.String LOC_ELEMENT_COMPL_ATTR
See Also:
Constant Field Values

BASEPOSITION_TAG

protected static final java.lang.String BASEPOSITION_TAG
See Also:
Constant Field Values

BASEPOSITION_TYPE_ATTR

protected static final java.lang.String BASEPOSITION_TYPE_ATTR
See Also:
Constant Field Values

BASEPOSITION_EXTENT_ATTR

protected static final java.lang.String BASEPOSITION_EXTENT_ATTR
See Also:
Constant Field Values

SEQUENCE_TAG

protected static final java.lang.String SEQUENCE_TAG
See Also:
Constant Field Values

SEQUENCE_TYPE_ATTR

protected static final java.lang.String SEQUENCE_TYPE_ATTR
See Also:
Constant Field Values

SEQUENCE_LENGTH_ATTR

protected static final java.lang.String SEQUENCE_LENGTH_ATTR
See Also:
Constant Field Values

SEQUENCE_TOPOLOGY_ATTR

protected static final java.lang.String SEQUENCE_TOPOLOGY_ATTR
See Also:
Constant Field Values

SEQUENCE_VER_ATTR

protected static final java.lang.String SEQUENCE_VER_ATTR
See Also:
Constant Field Values

xmlSchema

protected static final java.util.regex.Pattern xmlSchema
Constructor Detail

EMBLxmlFormat

public EMBLxmlFormat()
Method Detail

canRead

public boolean canRead(java.io.File file)
                throws java.io.IOException
Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in EMBLxml format if the second XML line contains the phrase "http://www.ebi.ac.uk/schema/EMBL_schema.xsd".

Specified by:
canRead in interface RichSequenceFormat
Overrides:
canRead in class RichSequenceFormat.BasicFormat
Parameters:
file - the File to check.
Returns:
true if the file is readable by this format, false if not.
Throws:
java.io.IOException - in case the file is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(java.io.File file)
                                           throws java.io.IOException
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.

Specified by:
guessSymbolTokenization in interface RichSequenceFormat
Overrides:
guessSymbolTokenization in class RichSequenceFormat.BasicFormat
Parameters:
file - the File object to guess the format of.
Returns:
a SymbolTokenization to read the file with.
Throws:
java.io.IOException - if the file is unrecognisable or inaccessible.

canRead

public boolean canRead(java.io.BufferedInputStream stream)
                throws java.io.IOException
Check to see if a given stream is in our format. A stream is in EMBLxml format if the second XML line contains the phrase "http://www.ebi.ac.uk/schema/EMBL_schema.xsd".

Parameters:
stream - the BufferedInputStream to check.
Returns:
true if the stream is readable by this format, false if not.
Throws:
java.io.IOException - in case the stream is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(java.io.BufferedInputStream stream)
                                           throws java.io.IOException
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.

Parameters:
stream - the BufferedInputStream object to guess the format of.
Returns:
a SymbolTokenization to read the stream with.
Throws:
java.io.IOException - if the stream is unrecognisable or inaccessible.

readSequence

public boolean readSequence(java.io.BufferedReader reader,
                            SymbolTokenization symParser,
                            SeqIOListener listener)
                     throws IllegalSymbolException,
                            java.io.IOException,
                            ParseException
Read a sequence and pass data on to a SeqIOListener.

Parameters:
reader - The stream of data to parse.
symParser - A SymbolParser defining a mapping from character data to Symbols.
listener - A listener to notify when data is extracted from the stream.
Returns:
a boolean indicating whether or not the stream contains any more sequences.
Throws:
IllegalSymbolException - if it is not possible to translate character data from the stream into valid BioJava symbols.
java.io.IOException - if an error occurs while reading from the stream.
ParseException

readRichSequence

public boolean readRichSequence(java.io.BufferedReader reader,
                                SymbolTokenization symParser,
                                RichSeqIOListener rlistener,
                                Namespace ns)
                         throws IllegalSymbolException,
                                java.io.IOException,
                                ParseException
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface.

Parameters:
reader - the input source
symParser - the tokenizer which understands the sequence being read
rlistener - the listener to send sequence events to
ns - the namespace to read sequences into.
Returns:
true if there is more to read after this, false otherwise.
Throws:
IllegalSymbolException - if the tokenizer couldn't understand one of the sequence symbols in the file.
java.io.IOException - if there was a read error.
ParseException

beginWriting

public void beginWriting()
                  throws java.io.IOException
Informs the writer that we want to start writing. This will do any initialisation required, such as writing the opening tags of an XML file that groups sequences together.

Throws:
java.io.IOException - if writing fails.

finishWriting

public void finishWriting()
                   throws java.io.IOException
Informs the writer that are done writing. This will do any finalisation required, such as writing the closing tags of an XML file that groups sequences together.

Throws:
java.io.IOException - if writing fails.

writeSequence

public void writeSequence(Sequence seq,
                          java.io.PrintStream os)
                   throws java.io.IOException
writeSequence writes a sequence to the specified PrintStream, using the default format.

Parameters:
seq - the sequence to write out.
os - the printstream to write to.
Throws:
java.io.IOException

writeSequence

public void writeSequence(Sequence seq,
                          java.lang.String format,
                          java.io.PrintStream os)
                   throws java.io.IOException
writeSequence writes a sequence to the specified PrintStream, using the specified format.

Parameters:
seq - a Sequence to write out.
format - a String indicating which sub-format of those available from a particular SequenceFormat implemention to use when writing.
os - a PrintStream object.
Throws:
java.io.IOException - if an error occurs.

writeSequence

public void writeSequence(Sequence seq,
                          Namespace ns)
                   throws java.io.IOException
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! Namespace is ignored as EMBLxml has no concept of it.

Parameters:
seq - the sequence to write
ns - the namespace to write it with
Throws:
java.io.IOException - in case it couldn't write something

getDefaultFormat

public java.lang.String getDefaultFormat()
getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.

Returns:
a String.