5.4. Introduction to Alignment Formats

A wide variety of sequence alignment formats are currently in use, leading to file-interconversion difficulties where diverse software packages are used. EMBOSS simplifies things by supporting most of the common alignment formats for input and output. This makes the interoperation with other sequence analysis packages easy. If your alignment is not in a recognised standard format you will first need to convert it into one.

5.4.1. What is an Alignment Format?

An alignment format defines the permitted layout and content of text in a file. This includes required text tokens and formatting conventions. Typically, the aligned sequences, sequence identifier codes and sequence position numbers are given in some form. Non-printable control characters do not occur in any of the common alignment formats, making most of the formats suitable for viewing on screen or for printing out.

"clustal" is probably the most commonly used alignment format. It is generated by the Clustalw multiple sequence alignment program. Many variants of this format are in common use. The first line must begin with the text CLUSTAL. The alignment is given in blocks of 50 residues with the aligned sequences appearing under each other. There are no sequence position numbers. Each sequence line begins with the sequence identifier code. For example:

CLUSTAL W(1.83) multiple sequence alignment


IXI_234         TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235         TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236         TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237         TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT
                                                                  

IXI_234         GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
IXI_235         GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAG
IXI_236         GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--G
IXI_237         GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--G
                                                                  

IXI_234         SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235         SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236         SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237         SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE

Beyond "clustal", the most widespread alignment formats are those generated by other popular programs:

markx

Variants (markx0, markx1, markx2, markx3, markx10) of this format are used by Bill Pearson's suite of FASTA programs. See http://rcr-www.med.nyu.edu/rcr/fastaman.html for more information.

MSF

MSF is the format used for multiple sequences by the Accelrys GCG suite, formerly known as the GCG Wisconsin Package. GCG is a commercial software package of programs and utilities for gene and protein analysis. For further information see http://www.accelrys.com/products/gcg/

In all the alignment formats except MSF, gaps inserted into the sequence during the alignment are indicated by the '-' character. In contrast, MSF format uses '.' (full stop) for gaps inside a sequence and '~' (tilde) for gaps at the ends of an alignment.

When editing alignments it is possible to use any text editor that is capable of writing files in plain text format. The drawback is that it is very easy to unwittingly break some rules of the format. Therefore, if you intend to manipulate or edit alignments substantially, investigate using a full-blown alignment editor such as mse. Such editors should have an option to save the alignment to a file in one or more of the standard formats.

5.4.2. Supported Alignment Formats

Some alignment formats can hold only a pair of sequences (pairwise alignment) whereas others can hold multiple sequences (multiple sequence alignment). You should never use a pairwise alignment format to hold a multiple sequence alignment as the file would be unparsable by EMBOSS and other systems.

An alignment can be read from a file in any of the standard sequence formats that are suitable for alignments. For descriptions and examples of the supported formats see Section A.3, “Supported Alignment Formats”.

The supported alignment formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Header (whether the standard EMBOSS alignment header is included), Minseq (minimum sequences in alignment), Maxseq (maximum sequences in alignment) and Description (short description of the format).

Table 5.5. Alignment formats
Output FormatNucProHeaderMinseqMaxseqDescription
clustalYesYesNo00clustalw format sequence
debugYesYesYes00Debugging trace of full internal data content
fastaYesYesNo00Fasta format sequence
markx0YesYesNo22Pearson MARKX0 format
markx1YesYesNo22Pearson MARKX1 format
markx10YesYesNo22Pearson MARKX10 format
markx2YesYesNo22Pearson MARKX2 format
markx3YesYesNo22Pearson MARKX3 format
matchYesYesYes22Start and end of matches between sequence pairs
megaYesYesNo00Mega format sequence
meganonYesYesNo00Mega non-interleaved format sequence
msfYesYesNo00MSF format sequence
nexusYesYesNo00nexus/paup format sequence
nexusnonYesYesNo00nexus/paup non-interleaved format sequence
pairYesYesYes22Simple pairwise alignment
phylipYesYesNo00phylip format sequence
phylipnonYesYesNo00phylip non-interleaved format sequence
scoreYesYesYes22Score values for pairs of sequences
selexYesYesNo00SELEX format sequence
simpleYesYesYes00Simple multiple alignment
srsYesYesYes00Simple multiple sequence format for SRS
srspairYesYesYes22Simple pairwise sequence format for SRS
tcoffeeYesYesNo00TCOFFEE program format
treeconYesYesNo00Treecon format sequence

5.4.3. Contents of an Alignment File

All alignment formats excluding those (FASTA, MSF) that are also standard sequence formats, have a block of information (comments) at the start of the alignment describing the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment. For example:

########################################
# Program:  demoalign
# Rundate:  Thu Jan 17 09:30:08 2002
# Report_file: stdout
########################################
#=======================================
#
# Aligned_sequences: 4
# 1: IXI_234
# 2: IXI_235
# 3: IXI_236
# 4: IXI_237
# Matrix: EBLOSUM62
# Gap_penalty: 9
# Extend_penalty: -1
#
# Length: 131
# Identity:      95/131 (72.5%)
# Similarity:   127/131 (96.9%)
# Gaps:          25/131 (19.1%)
# Score:  100.0
#
#=======================================   

Length. 

# Length: 131

This line gives the length of the alignment, including any gaps that have been introduced to construct the alignment.

Identity. 

# Identity:      95/131 (72.5%)

This line gives the number of positions (95) over the length of the alignment where all of the residues or bases at that position are identical. The alignment length (131) and the percentage of positions (72.5%) in the alignment where there are such identities are shown.

Similarity. 

# Similarity:   127/131 (96.9%)

This line gives the number of positions over the length of the alignment where over 50% of the residues or bases at that position are similar to one another. Two residues or bases are defined as similar when their comparison score, as defined by the comparison matrix used by the alignment program, has a positive score. Again the alignment length (131) and the percentage (96.9%) of positions in the alignment where there are similarities is indicated.

Gaps. 

# Gaps:          25/131 (19.1%)

This line gives the number of positions over the length of the alignment containing one or more gap characters. The alignment length (131) and percentage (19.1%) of positions in the alignment where there are gaps is given.

Score. 

# Score: 100.0

This line gives the score for the alignment. The value depends on the alignment program, particularly the comparison matrix and gap penalties, and of course the sequences that were aligned. For an explanation of the score schemes refer to the relevant application documentation.

5.4.4. Specifying Alignments on the Command Line

Alignment inputs are referred to on the EMBOSS command line by their USA (Section 6.6, “The Uniform Sequence Address (USA)”). This is a standard sequence naming scheme used by all EMBOSS applications. In the case of alignments, a USA specifies two or more sequences that should be read from a file. Other sequence sources such as an application or a web server can also be specified.

Alignment outputs, in contrast, are not specified by a USA. If you want to write an alignment to a file in one of the standard alignment formats, you must specify a simple name for the file as you would for a standard output file.

You may also write aligned sequences to a file in one of the standard sequences formats (Section A.1, “Supported Sequence Formats”). These are the formats to choose when the sequences are to be used by another EMBOSS application. In such cases, the sequences given will still be in their aligned condition (i.e. may include gap characters) but the alignment will not be obvious as the sequences will not be lined up and sequence positions may not be indicated.

There are also a set of command line qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) that are used for alignment input and output. These allow you to set such things as the file name and format and the records to print to the standard alignment header (see above).

For example, here the water application is called on two input sequences (Seq1.seq and Seq2.seq) to generate an alignment output file (SeqOut.seq) in MSF format:

water Seq1.seq Seq2.seq SeqOut.seq -aformat msf

5.4.5. Applications for Sequence Alignment

Various programs for handling alignments (xref linkend="FormatsAlignmentIntroApps" />) are provided and are organised into groups of related functionality (xref linkend="FormatsAlignmentIntroGroups" />). These are either part of the main package or are included under the EMBASSY grouping (xref linkend="FormatsAlignmentIntroPackages" />).

Application Groups for Alignments

Application GroupDescription
ConsensusMerging sequences to make a consensus
DifferencesFinding differences between sequences
Dot_plotsDot plot sequence comparisons
GlobalGlobal sequence alignment
LocalLocal sequence alignment
MultipleMultiple sequence alignment

Alignment Applications

GroupApplicationDescription
ConsensusconsCreates a consensus from multiple alignments
megamergerMerge two large overlapping nucleic acid sequences
mergerMerge two overlapping sequences
DifferencesdiffseqFind differences between nearly identical sequences
Dot_plotsdotmatcherDisplays a thresholded dotplot of two sequences
dotpathNon-overlapping wordmatch dotplot of two sequences
dottupDisplays a wordmatch dotplot of two sequences
polydotDisplays all-against-all dotplots of a set of sequences
Globalest2genomeAlign EST and genomic DNA sequences
needleNeedleman-Wunsch global alignment
stretcherFinds the best global alignment between two sequences
esim4Align an mRNA to a genomic DNA sequence
LocalmatcherFinds the best local alignments between two sequences
seqmatchallAll-against-all comparison of a set of sequences
supermatcherMatch large sequences against one or more other sequences
waterSmith-Waterman local alignment
wordmatchFinds all exact matches of a given size between 2 sequences
MultipleemmaMultiple alignment program - interface to ClustalW program
infoalignInformation on a multiple sequence alignment
plotconPlot quality of conservation of a sequence alignment
prettyplotDisplays aligned sequences, with colouring and boxing
showalignDisplays a multiple sequence alignment
tranalignAlign nucleic coding regions given the aligned proteins
mseMultiple Sequence Editor

EMBASSY Packages for Alignments

EMBASSY packageDescription
PHYLIPNEWThe PHYLIPNEW programs are EMBOSS conversions of the programs in Joe Felsenstein's PHYLIP package, version 3.68. The package is used for phylogenetic analysis.
MSEThe MSE package is a multiple sequence editor. The program was contributed to the EMBOSS package by the author, Will Gilbert, as one of the first EMBASSY programs. Users of the GCG package may find this program familiar - GCG converted an earlier (fortran) version of the same program to be their sequence assembly editor.
ESIM4The ESIM4 package is an EMBOSS conversion of the SIM4 package from Liliana Florea. The esim4 application aligns an mRNA to a genomic DNA sequence.
HMMERNEWEMBASSY HMMERNEW is a suite of application wrappers to the original hmmer v2.3.2 applications written by Sean Eddy. The ehmmalign application aligns sequences to an HMM profile and the ehmmbuild application builds a profile HMM from an alignment.