EMBOSS: newcpgseek


Program newcpgseek

Function

Reports CpG rich regions

Description

newcpgseek reports CpG rich regions of a sequence as candidate CpG islands.

CpG refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases.

Detection of regions of genomic sequences that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on. Regions rich in the CpG pattern are known as CpG islands.

It has been estimated that about half of all mammalian genes have a CpG-rich region around their 5' end. It is said that all mammalian house-keeping genes have a CpG island!

Non-mammalian vertebrates have some CpG islands that are associated with genes, but the association gets equivocal in the farther taxonomic groups.

Finding a CpG island upstream of predicted exons or genes is good contributory evidence.

By default, this program defines a CpG island as a region where, over an average of 10 windows, the calculated % composition is over 50% and the calculated Obs/Exp ratio is over 0.6 and the conditions hold for a minimum of 200 bases. These conditions can be modified by setting the values of the appropriate parameters.

The Expected number of CpG patterns in a window is calculated as the number of 'C's in the window multiplied by the number of 'G's in the window, divided by the window length.

This program reads in one or more sequences and finds regions where there is a high absolute frequency of CpG dimers as well as a high proportion of CpG compared to GpC.

Usage

Here is a sample session with newcpgseek.

% newcpgseek
Input sequence: embl:rnu68037
CpG score [17]: 
Output file [rnu68037.newcpgseek]: 

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
   -score              integer    CpG score
  [-outfile]           outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
-score CpG score Integer from 1 to 200 17
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.newcpgseek
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

A nucleic acid sequence.

Output file format

Here is the output from the example run:


NEWCPGSEEK of RNU68037 from 1 to 1218
with score > 17 

 Begin    End  Score        CpG  %CG  CG/GC
*    96   1032   630         87  66.1   0.65
  1072   1100    26          3  62.1   0.00
  1183   1193    26          2  72.7   2.00
-------------------------------------------

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0.

Known bugs

None.

See also

Program nameDescription
cpgplotPlot CpG rich areas
cpgreportReports all CpG rich regions
geeceeCalculates the fractional GC content of nucleic acid sequences
newcpgreportReport CpG rich areas

Author(s)

This application was written by Rodrigo Lopez (rls@ebi.ac.uk) European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

History

Written (1999) - Rodrigo Lopez

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments