EMBOSS: wordmatch


Program wordmatch

Function

Finds all exact matches of a given size between 2 sequences

Description

Finds all exact matches of a given minimum size between 2 sequences displaying the start points in each sequence and the match length.

This program takes two sequences and finds regions where they are identical. These regions are reported in the output file (and optionally) in GFF (Gene Feature Format) files.

It will not find identical regions smaller than the specified wordsize.

Usage

Here is a sample session with wordmatch.

% wordmatch sw:hba_human sw:hbb_human
Output file [hba_human.wordmatch]: 
Word size [4]: 

Command line arguments

   Mandatory qualifiers:
  [-asequence]         sequence   Sequence USA
  [-bsequence]         sequence   Sequence USA
   -wordsize           integer    Word size
  [-outfile]           align      (no help text) align value

   Optional qualifiers: (none)
   Advanced qualifiers:
   -afeatout           featout    File for output of normal tab delimited GFF
                                  features
   -bfeatout           featout    File for output of normal tab delimited GFF
                                  features

   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-asequence]
(Parameter 1)
Sequence USA Readable sequence Required
[-bsequence]
(Parameter 2)
Sequence USA Readable sequence Required
-wordsize Word size Integer 2 or more 4
[-outfile]
(Parameter 3)
(no help text) align value Alignment file  
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
-afeatout File for output of normal tab delimited GFF features Writeable feature table unknown.gff
-bfeatout File for output of normal tab delimited GFF features Writeable feature table unknown.gff

Input file format

Any two sequence USAs of the same type (DNA or protein).

Output file format

The file produced in the above example is:


FINALLY length = 3
 HBA_HUMAN  HBB_HUMAN Length
        58          63          5
        14          15          4
       116         121          4

The first line ('FINALLY...') gives the number of regions found.

The next line gives the headers for the subsequent columns of data. This consists for the names of the two sequence and the word 'Length'.

Subsequent lines consist of three columns fo numbers separated by spaces or TAB characters. Each line contains the information on one identical region. The first column is the start position in the first sequence of the identical region. The second number is the start position in the second sequence. the third number is the length of the identical region.

If no regions are found, the output file is blank.

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 if successful.

Known bugs

None.

See also

Program nameDescription
matcherFinds the best local alignments between two sequences
seqmatchallDoes an all-against-all comparison of a set of sequences
supermatcherFinds a match of a large sequence against one or more sequences
waterSmith-Waterman local alignment

Author(s)

This application was written by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

Completed 27th November 1998.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments