A wide variety of sequence alignment formats are currently in use, leading to file-interconversion difficulties where diverse software packages are used. EMBOSS simplifies things by supporting most of the common alignment formats for input and output. This makes the interoperation with other sequence analysis packages easy. If your alignment is not in a recognised standard format you will first need to convert it into one.
An alignment format defines the permitted layout and content of text in a file. This includes required text tokens and formatting conventions. Typically, the aligned sequences, sequence identifier codes and sequence position numbers are given in some form. Non-printable control characters do not occur in any of the common alignment formats, making most of the formats suitable for viewing on screen or for printing out.
"clustal" is probably the most commonly used alignment format. It is generated by the Clustalw multiple sequence alignment program. Many variants of this format are in common use. The first line must begin with the text CLUSTAL
. The alignment is given in blocks of 50 residues with the aligned sequences appearing under each other. There are no sequence position numbers. Each sequence line begins with the sequence identifier code. For example:
CLUSTAL W(1.83) multiple sequence alignment IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT IXI_235 TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT IXI_236 TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT IXI_237 TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT IXI_234 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG IXI_235 GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAG IXI_236 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--G IXI_237 GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--G IXI_234 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE IXI_235 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE IXI_236 SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE IXI_237 SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE
Beyond "clustal", the most widespread alignment formats are those generated by other popular programs:
Variants (markx0, markx1, markx2, markx3, markx10) of this format are used by Bill Pearson's suite of FASTA programs. See http://rcr-www.med.nyu.edu/rcr/fastaman.html for more information.
MSF is the format used for multiple sequences by the Accelrys GCG suite, formerly known as the GCG Wisconsin Package. GCG is a commercial software package of programs and utilities for gene and protein analysis. For further information see http://www.accelrys.com/products/gcg/
In all the alignment formats except MSF, gaps inserted into the sequence during the alignment are indicated by the '-
' character. In contrast, MSF format uses '.
' (full stop) for gaps inside a sequence and '~
' (tilde) for gaps at the ends of an alignment.
When editing alignments it is possible to use any text editor that is capable of writing files in plain text format. The drawback is that it is very easy to unwittingly break some rules of the format. Therefore, if you intend to manipulate or edit alignments substantially, investigate using a full-blown alignment editor such as mse. Such editors should have an option to save the alignment to a file in one or more of the standard formats.
Some alignment formats can hold only a pair of sequences (pairwise alignment) whereas others can hold multiple sequences (multiple sequence alignment). You should never use a pairwise alignment format to hold a multiple sequence alignment as the file would be unparsable by EMBOSS and other systems.
An alignment can be read from a file in any of the standard sequence formats that are suitable for alignments. For descriptions and examples of the supported formats see Section A.3, “Supported Alignment Formats”.
The supported alignment formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Header (whether the standard EMBOSS alignment header is included), Minseq (minimum sequences in alignment), Maxseq (maximum sequences in alignment) and Description (short description of the format).
Output Format | Nuc | Pro | Header | Minseq | Maxseq | Description |
---|---|---|---|---|---|---|
clustal | Yes | Yes | No | 0 | 0 | clustalw format sequence |
debug | Yes | Yes | Yes | 0 | 0 | Debugging trace of full internal data content |
fasta | Yes | Yes | No | 0 | 0 | Fasta format sequence |
markx0 | Yes | Yes | No | 2 | 2 | Pearson MARKX0 format |
markx1 | Yes | Yes | No | 2 | 2 | Pearson MARKX1 format |
markx10 | Yes | Yes | No | 2 | 2 | Pearson MARKX10 format |
markx2 | Yes | Yes | No | 2 | 2 | Pearson MARKX2 format |
markx3 | Yes | Yes | No | 2 | 2 | Pearson MARKX3 format |
match | Yes | Yes | Yes | 2 | 2 | Start and end of matches between sequence pairs |
mega | Yes | Yes | No | 0 | 0 | Mega format sequence |
meganon | Yes | Yes | No | 0 | 0 | Mega non-interleaved format sequence |
msf | Yes | Yes | No | 0 | 0 | MSF format sequence |
nexus | Yes | Yes | No | 0 | 0 | nexus/paup format sequence |
nexusnon | Yes | Yes | No | 0 | 0 | nexus/paup non-interleaved format sequence |
pair | Yes | Yes | Yes | 2 | 2 | Simple pairwise alignment |
phylip | Yes | Yes | No | 0 | 0 | phylip format sequence |
phylipnon | Yes | Yes | No | 0 | 0 | phylip non-interleaved format sequence |
score | Yes | Yes | Yes | 2 | 2 | Score values for pairs of sequences |
selex | Yes | Yes | No | 0 | 0 | SELEX format sequence |
simple | Yes | Yes | Yes | 0 | 0 | Simple multiple alignment |
srs | Yes | Yes | Yes | 0 | 0 | Simple multiple sequence format for SRS |
srspair | Yes | Yes | Yes | 2 | 2 | Simple pairwise sequence format for SRS |
tcoffee | Yes | Yes | No | 0 | 0 | TCOFFEE program format |
treecon | Yes | Yes | No | 0 | 0 | Treecon format sequence |
All alignment formats excluding those (FASTA, MSF) that are also standard sequence formats, have a block of information (comments) at the start of the alignment describing the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment. For example:
######################################## # Program: demoalign # Rundate: Thu Jan 17 09:30:08 2002 # Report_file: stdout ######################################## #======================================= # # Aligned_sequences: 4 # 1: IXI_234 # 2: IXI_235 # 3: IXI_236 # 4: IXI_237 # Matrix: EBLOSUM62 # Gap_penalty: 9 # Extend_penalty: -1 # # Length: 131 # Identity: 95/131 (72.5%) # Similarity: 127/131 (96.9%) # Gaps: 25/131 (19.1%) # Score: 100.0 # #=======================================
Length.
# Length: 131
This line gives the length of the alignment, including any gaps that have been introduced to construct the alignment.
Identity.
# Identity: 95/131 (72.5%)
This line gives the number of positions (95) over the length of the alignment where all of the residues or bases at that position are identical. The alignment length (131) and the percentage of positions (72.5%) in the alignment where there are such identities are shown.
Similarity.
# Similarity: 127/131 (96.9%)
This line gives the number of positions over the length of the alignment where over 50% of the residues or bases at that position are similar to one another. Two residues or bases are defined as similar when their comparison score, as defined by the comparison matrix used by the alignment program, has a positive score. Again the alignment length (131) and the percentage (96.9%) of positions in the alignment where there are similarities is indicated.
Gaps.
# Gaps: 25/131 (19.1%)
This line gives the number of positions over the length of the alignment containing one or more gap characters. The alignment length (131) and percentage (19.1%) of positions in the alignment where there are gaps is given.
Score.
# Score: 100.0
This line gives the score for the alignment. The value depends on the alignment program, particularly the comparison matrix and gap penalties, and of course the sequences that were aligned. For an explanation of the score schemes refer to the relevant application documentation.
Alignment inputs are referred to on the EMBOSS command line by their USA (Section 6.6, “The Uniform Sequence Address (USA)”). This is a standard sequence naming scheme used by all EMBOSS applications. In the case of alignments, a USA specifies two or more sequences that should be read from a file. Other sequence sources such as an application or a web server can also be specified.
Alignment outputs, in contrast, are not specified by a USA. If you want to write an alignment to a file in one of the standard alignment formats, you must specify a simple name for the file as you would for a standard output file.
You may also write aligned sequences to a file in one of the standard sequences formats (Section A.1, “Supported Sequence Formats”). These are the formats to choose when the sequences are to be used by another EMBOSS application. In such cases, the sequences given will still be in their aligned condition (i.e. may include gap characters) but the alignment will not be obvious as the sequences will not be lined up and sequence positions may not be indicated.
There are also a set of command line qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) that are used for alignment input and output. These allow you to set such things as the file name and format and the records to print to the standard alignment header (see above).
For example, here the water application is called on two input sequences (Seq1.seq
and Seq2.seq
) to generate an alignment output file (SeqOut.seq
) in MSF format:
water Seq1.seq Seq2.seq SeqOut.seq -aformat msf |
Various programs for handling alignments (xref linkend="FormatsAlignmentIntroApps" />) are provided and are organised into groups of related functionality (xref linkend="FormatsAlignmentIntroGroups" />). These are either part of the main package or are included under the EMBASSY grouping (xref linkend="FormatsAlignmentIntroPackages" />).
Application Group | Description |
---|---|
Consensus | Merging sequences to make a consensus |
Differences | Finding differences between sequences |
Dot_plots | Dot plot sequence comparisons |
Global | Global sequence alignment |
Local | Local sequence alignment |
Multiple | Multiple sequence alignment |
Group | Application | Description |
---|---|---|
Consensus | cons | Creates a consensus from multiple alignments |
megamerger | Merge two large overlapping nucleic acid sequences | |
merger | Merge two overlapping sequences | |
Differences | diffseq | Find differences between nearly identical sequences |
Dot_plots | dotmatcher | Displays a thresholded dotplot of two sequences |
dotpath | Non-overlapping wordmatch dotplot of two sequences | |
dottup | Displays a wordmatch dotplot of two sequences | |
polydot | Displays all-against-all dotplots of a set of sequences | |
Global | est2genome | Align EST and genomic DNA sequences |
needle | Needleman-Wunsch global alignment | |
stretcher | Finds the best global alignment between two sequences | |
esim4 | Align an mRNA to a genomic DNA sequence | |
Local | matcher | Finds the best local alignments between two sequences |
seqmatchall | All-against-all comparison of a set of sequences | |
supermatcher | Match large sequences against one or more other sequences | |
water | Smith-Waterman local alignment | |
wordmatch | Finds all exact matches of a given size between 2 sequences | |
Multiple | emma | Multiple alignment program - interface to ClustalW program |
infoalign | Information on a multiple sequence alignment | |
plotcon | Plot quality of conservation of a sequence alignment | |
prettyplot | Displays aligned sequences, with colouring and boxing | |
showalign | Displays a multiple sequence alignment | |
tranalign | Align nucleic coding regions given the aligned proteins | |
mse | Multiple Sequence Editor |
EMBASSY package | Description |
---|---|
PHYLIPNEW | The PHYLIPNEW programs are EMBOSS conversions of the programs in Joe Felsenstein's PHYLIP package, version 3.68. The package is used for phylogenetic analysis. |
MSE | The MSE package is a multiple sequence editor. The program was contributed to the EMBOSS package by the author, Will Gilbert, as one of the first EMBASSY programs. Users of the GCG package may find this program familiar - GCG converted an earlier (fortran) version of the same program to be their sequence assembly editor. |
ESIM4 | The ESIM4 package is an EMBOSS conversion of the SIM4 package from Liliana Florea. The esim4 application aligns an mRNA to a genomic DNA sequence. |
HMMERNEW | EMBASSY HMMERNEW is a suite of application wrappers to the original hmmer v2.3.2 applications written by Sean Eddy. The ehmmalign application aligns sequences to an HMM profile and the ehmmbuild application builds a profile HMM from an alignment. |