frename

Safe HaskellSafe-Infered

Bio.Sequence.SeqData

Contents

Description

Data structures for manipulating (biological) sequences.

Generally supports both nucleotide and protein sequences, some functions, like revcompl, only makes sense for nucleotides.

Synopsis

Data structure

A sequence is a header, sequence data itself, and optional quality data. Sequences are type-tagged to identify them as nucleotide, amino acids, or unknown type. All items are lazy bytestrings. The Offset type can be used for indexing.

If you use overloaded strings (e.g., ghc -XOverloadedString), you can easily construct sequences from string literals.

data Sequence t

A sequence consists of a header, the sequence data itself, and optional quality data. The type parameter is a phantom type to separate nucleotide and amino acid sequences

Constructors

Seq !SeqData !SeqData !(Maybe QualData)

header and actual sequence

Instances

Eq (Sequence t) 
Show (Sequence a)

A more arranged show instance for Sequences reassembling the display of the fasta-format

type Offset = Int64

An offset, index, or length of a SeqData

type SeqData = ByteString

The basic data type used in Sequences

Quality data is normally associated with nucleotide sequences

type Qual = Word8

Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.

type QualData = ByteString

Quality data is a Qual vector, currently implemented as a ByteString.

Accessor functions

(!) :: Sequence a -> Offset -> Char

Read the character at the specified position in the sequence.

seqlength :: Sequence a -> Offset

Return sequence length.

seqlabel :: Sequence a -> SeqData

Return sequence label (first word of header)

seqheader :: Sequence a -> SeqData

Return full header.

seqdata :: Sequence a -> SeqData

Return the sequence data.

(?) :: Sequence a -> Offset -> Qual

hasqual :: Sequence a -> Bool

Check whether the sequence has associated quality data.

seqqual :: Sequence a -> QualData

Return the quality data, or error if none exist. Use hasqual if in doubt.

Adding information to header

appendHeader, setHeader :: Sequence a -> String -> Sequence a

Modify the header by appending text, or by replacing all but the sequence label (i.e. first word).

Converting to and from [Char]

It is probably better to use the IsString class from String for this.

fromStr :: String -> SeqData

Convert a String to SeqData

toStr :: SeqData -> String

Convert a SeqData to a String

Sequence utilities

defragSeq :: Sequence t -> Sequence t

Returns a sequence with all internal storage freshly copied and with sequence and quality data present as a single chunk.

By freshly copying internal storage, defragSeq allows garbage collection of the original data source whence the sequence was read; otherwise, use of just a short sequence name can cause an entire sequence file buffer to be retained.

By compacting sequence data into a single chunk, defragSeq avoids linear-time traversal of sequence chunks during random access into sequence data.

seqmap :: ((Char, Qual) -> (Char, Qual)) -> Sequence t -> Sequence t

map over sequences, treating them as a sequence of (char,word8) pairs. This will work on sequences without quality, as long as the function doesn't try to examine it. The current implementation is not very efficient.

castSeq :: Sequence a -> Sequence b

Phantom type functionality, unchecked conversion between sequence types

Nucleotide functionality

Nucleotide sequences contain the alphabet [A,C,G,T]. IUPAC specifies an extended nucleotide alphabet with wildcards, but it is not supported at this point.

compl :: Char -> Char

Complement a single character. I.e. identify the nucleotide it can hybridize with. Note that for multiple nucleotides, you usually want the reverse complement (see revcompl for that).

revcompl :: Sequence Nuc -> Sequence Nuc

Calculate the reverse complement. This is only relevant for the nucleotide alphabet, and it leaves other characters unmodified.

revcompl' :: SeqData -> SeqData

Calculate the reverse complent for SeqData only.

data Nuc

For type tagging sequences (protein sequences use Amino below)

Protein functionality

Proteins are chains of amino acids, represented by the IUPAC alphabet.

data Amino

Constructors

Ala 
Arg 
Asn 
Asp 
Cys 
Gln 
Glu 
Gly 
His 
Ile 
Leu 
Lys 
Met 
Phe 
Pro 
Ser 
Thr 
Tyr 
Trp 
Val 
STP 
Asx 
Glx 
Xle 
Xaa 

Instances

translate :: Sequence Nuc -> Offset -> [Amino]

Translate a nucleotide sequence into the corresponding protein sequence. This works rather blindly, with no attempt to identify ORFs or otherwise QA the result.

fromIUPAC :: SeqData -> [Amino]

Convert a sequence in IUPAC format to a list of amino acids.

toIUPAC :: [Amino] -> SeqData

Convert a list of amino acids to a sequence in IUPAC format.

Display a nicely formated sequence.

putSeqLn :: Sequence a -> Int -> Int -> [(Int, Int)] -> IO ()

A simple function to display a sequence: we generate the sequence string and | call putStrLn

seqToStr :: Sequence a -> Int -> Int -> [(Int, Int)] -> [Char]

Returns a properly formatted and probably highlighted string | representation of a sequence. Highlighting is done using ANSI-Escape | sequences.

Default type for sequences

data Unknown