EMBOSS: diffseq


Program diffseq

Function

Find differences (SNPs) between nearly identical sequences

Description

diffseq takes two overlapping, nearly identical sequences and reports the differences between them, together with any features that overlap with these regions. GFF files of the differences in each sequence are also produced.

diffseq should be of value when looking for SNPs, differences between strains of an organism and anything else that requires the differences between sequences to be highlighted.

The sequences can be very long. The program does a match of all sequence words of size 10 (by default). It then reduces this to the minimum set of overlapping matches by sorting the matches in order of size (largest size first) and then for each such match it removes any smaller matches that overlap. The result is a set of the longest ungapped alignments between the two sequences that do not overlap with each other. The mismatched regions between these matches are reported.

It should be possible to find differences between sequences that are Mega bytes long.

Usage

Here is a sample session with diffseq:

% diffseq tembl:ap000504 tembl:af129756
Find differences (SNPs) between nearly identical sequences
Word size [10]: 
Output file [ap000504.diffseq]:

Command line arguments

   Mandatory qualifiers:
  [-asequence]         sequence   Sequence USA
  [-bsequence]         sequence   Sequence USA
   -wordsize           integer    Word size
   -outfile            report     Output report file

   Optional qualifiers:
   -afeatout           featout    File for output of first sequence's normal
                                  tab delimited gff's
   -bfeatout           featout    File for output of second sequence's normal
                                  tab delimited gff's
   -columns            bool       The default format for the output report
                                  file is to have several lines per difference
                                  giving the sequence positions, sequences
                                  and features.
                                  If this option is set true then the output
                                  report file's format is changed to a set of
                                  columns and no feature information is given.

   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-asequence]
(Parameter 1)
Sequence USA Readable sequence Required
[-bsequence]
(Parameter 2)
Sequence USA Readable sequence Required
-wordsize Word size Integer 2 or more 10
-outfile Output report file Report file  
Optional qualifiers Allowed values Default
-afeatout File for output of first sequence's normal tab delimited gff's Writeable feature table $(asequence.name).diffgff
-bfeatout File for output of second sequence's normal tab delimited gff's Writeable feature table $(bsequence.name).diffgff
-columns The default format for the output report file is to have several lines per difference giving the sequence positions, sequences and features. If this option is set true then the output report file's format is changed to a set of columns and no feature information is given. Yes/No No
Advanced qualifiers Allowed values Default
(none)

Input file format

This program reads in two nucleic acid sequence USAs or two protein sequence USAs.

Output file format

A report of the differences between the two sequences is produced, together with any features that overlap with these differing regions.

The output is a standard EMBOSS report file.

The results can be output in one of several styles by using the command-line qualifier -rformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel, feattable, motif, regions, seqtable, simple, srs, table, tagseq

See: http://www.uk.embnet.org/Software/EMBOSS/Themes/ReportFormats.html for further information on report formats.

By default marscan writes a 'diffseq' report file.


######################################## # Program: diffseq # Rundate: Mon Feb 11 13:16:56 2002 # Report_file: ap000504.diffseq # Additional_files: 2 # 1: AP000504.diffgff (Feature file for first sequence) # 2: AF129756.diffgff (Feature file for second sequence) ######################################## #======================================= # # Sequence: AP000504 from: 1 to: 100000 # HitCount: 119 # # Compare: AF129756 from: 1 to: 184666 # # AP000504 overlap starts at 1 # AF129756 overlap starts at 6036 # # (AP000504) start end length sequence # (AF129756) start end length sequence # # # #======================================= AP000504 847-847 Length: 1 Sequence: a Sequence: t AF129756 6882-6882 Length: 1 AP000504 1795-1795 Length: 1 Sequence: g Sequence: a AF129756 7830-7830 Length: 1 AP000504 2273-2273 Length: 1 Sequence: t Sequence: Feature: repeat_region 7920-8351 rpt_family='MSTB' AF129756 8307 Length: 0 AP000504 2466-2466 Length: 1 Sequence: g Sequence: a Feature: repeat_region 8391-8686 rpt_family='AluSg' AF129756 8500-8500 Length: 1 AP000504 2655-2658 Length: 4 Sequence: tgtg Sequence: Feature: repeat_region 8687-8731 rpt_family='(CA)n' AF129756 8688 Length: 0 AP000504 4914 Length: 0 Sequence: Sequence: gtgtgtgtgtgtgtgtgt Feature: repeat_region 10910-10972 rpt_family='(CA)n' AF129756 10945-10962 Length: 18 AP000504 4951-4953 Length: 3 Sequence: aaa Sequence: tat Feature: repeat_region 10991-11020 rpt_family='AT_rich' AF129756 10999-11001 Length: 3 AP000504 6600-6600 Length: 1 Sequence: t Sequence: Feature: repeat_region 12628-12930 rpt_family='AluSq' AF129756 12647 Length: 0 etc. AP000504 97273-97274 Length: 2 Sequence: aa Sequence: Feature: repeat_region 103299-103402 rpt_family='AluSq' AF129756 103302 Length: 0 AP000504 97716-97716 Length: 1 Sequence: a Sequence: g AF129756 103744-103744 Length: 1 AP000504 97827-97827 Length: 1 Sequence: c Sequence: t Feature: repeat_region 103784-104083 rpt_family='AluSx' AF129756 103855-103855 Length: 1 #--------------------------------------- # # Overlap_end: 100000 in AP000504 # Overlap_end: 106028 in AF129756 # # SNP_count: 86 # Transitions: 58 # Transversions: 28 # # #---------------------------------------

The first line is the title giving the names of the sequences used.

The next two non-blank lines state the positions in each sequence where the detected overlap between them starts.

There then follows a set of reports of the mismatches between the sequences.
Each report consists of 4 or more lines.

This is followed by the equivalent information for the second sequence, but in the reverse order, namely 'Sequence:' line, 'Feature:' lines and line giving the position of the mismatch in the second sequence.

At the end of the report are two non-blank lines giving the positions in each sequence where the detected overlap between them ends.

The last three lines of the report gives the counts of SNPs (defined as a change of one nucleotide to one other nucleotide, no deletions or insertions are counted, no multi-base changes are counted).

The counts of transitions (Pyrimide to Pyrimidine or Purine to Purine) and transversions (Pyrimidine to Purine) are also given.

It should be noted that not all features are reported.

The 'source' feature found in all EMBL/Genbank feature table entries is not reported as this covers all of the sequence and so overlaps with any difference found in that sequence and so is uninformative and irritating. It has therefore been removed from the output report.

The translation information of CDS features is often extremely long and does not add useful information to the report. It has therefore been removed from the output report.

If no regions of alignment are found, the following output is given:


######################################## # Program: diffseq # Rundate: Mon Feb 11 13:21:20 2002 # Report_file: ap000504.diffseq # Additional_files: 2 # 1: AP000504.diffgff (Feature file for first sequence) # 2: fred.diffgff (Feature file for second sequence) ######################################## #======================================= # # Sequence: AP000504 from: 1 to: 100000 # HitCount: 0 #======================================= #--------------------------------------- # # No regions of alignment found. # # #---------------------------------------

If the -rformat table qualifier is given then the output is given in a columnar format.

The columns are separated by one or more spaces or TAB characters in the order:

For example:


######################################## # Program: diffseq # Rundate: Mon Feb 11 13:28:25 2002 # Report_file: ap000504.diffseq # Additional_files: 2 # 1: AP000504.diffgff (Feature file for first sequence) # 2: AF129756.diffgff (Feature file for second sequence) ######################################## #======================================= # # Sequence: AP000504 from: 1 to: 100000 # HitCount: 119 # # Compare: AF129756 from: 1 to: 184666 # # AP000504 overlap starts at 1 # AF129756 overlap starts at 6036 # # (AP000504) start end length sequence # (AF129756) start end length sequence # # # #======================================= USA Start End Score start end length name sequence first_feature second_feature AP000504 847 847 0.000 6882 6882 1 AF129756 t . . AP000504 1795 1795 0.000 7830 7830 1 AF129756 a . . AP000504 2273 2273 0.000 8307 8306 . AF129756 . . repeat_region 7920-8351 rpt_family='MSTB' AP000504 2466 2466 0.000 8500 8500 1 AF129756 a . repeat_region 8391-8686 rpt_family='AluSg' AP000504 2655 2658 0.000 8688 8687 . AF129756 . . repeat_region 8687-8731 rpt_family='(CA)n' AP000504 4914 4913 0.000 10945 10962 18 AF129756 gtgtgtgtgtgtgtgtgt . repeat_region 10910-10972 rpt_family='(CA)n' etc. AP000504 93860 93860 0.000 99890 99890 1 AF129756 g . . AP000504 95451 95451 0.000 101481 101481 1 AF129756 t . . AP000504 96650 96650 0.000 102680 102680 1 AF129756 t . . AP000504 97273 97274 0.000 103302 103301 . AF129756 . . repeat_region 103299-103402 rpt_family='AluSq' AP000504 97716 97716 0.000 103744 103744 1 AF129756 g . . AP000504 97827 97827 0.000 103855 103855 1 AF129756 t . repeat_region 103784-104083 rpt_family='AluSx' #--------------------------------------- # # Overlap_end: 100000 in AP000504 # Overlap_end: 106028 in AF129756 # # SNP_count: 86 # Transitions: 58 # Transversions: 28 # # #---------------------------------------

If no regions of alignment are found, the following output is given:


######################################## # Program: diffseq # Rundate: Mon Feb 11 13:34:34 2002 # Report_file: ap000504.diffseq # Additional_files: 2 # 1: AP000504.diffgff (Feature file for first sequence) # 2: fred.diffgff (Feature file for second sequence) ######################################## #======================================= # # Sequence: AP000504 from: 1 to: 100000 # HitCount: 0 #======================================= USA Start End Score start end length name sequence first_feature second_feature #--------------------------------------- # # No regions of alignment found. # # #---------------------------------------

Data files

Notes

It should be noted that not all features are reported.

The 'source' feature found in all EMBL/Genbank feature table entries is not reported as this covers all of the sequence and so overlaps with any difference found in that sequence and so is uninformative and irritating. It has therefore been removed from the output report.

The translation information of CDS features is often extremely long and does not add useful information to the report. It has therefore been removed from the output report.

If you run out of memory, use a larger word size.

Using a larger word size increases the length between mismatches that will be reported as one event. Thus a word size of 50 will report two SNP that are with 50 bases of each other as one mismatch.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription

A graphical dotplot of the matches used in this program can be displayed using the program dotpath.

Author(s)

This application was written by Gary Williams (gwilliam@hgmp.mrc.ac.uk)

History

Written 15th Aug 2000 - Gary Williams.
18th Aug 2000 - Added writing out GFF files of the mismatched regions

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments