5.4. Introduction to Alignment Formats

5.4. Introduction to Alignment Formats
Prev	Chapter 5. File Formats	Next

A wide variety of sequence alignment formats are currently in use, leading to file-interconversion difficulties where diverse software packages are used. EMBOSS simplifies things by supporting most of the common alignment formats for input and output. This makes the interoperation with other sequence analysis packages easy. If your alignment is not in a recognised standard format you will first need to convert it into one.

5.4.1. What is an Alignment Format?

An alignment format defines the permitted layout and content of text in a file. This includes required text tokens and formatting conventions. Typically, the aligned sequences, sequence identifier codes and sequence position numbers are given in some form. Non-printable control characters do not occur in any of the common alignment formats, making most of the formats suitable for viewing on screen or for printing out.

"clustal" is probably the most commonly used alignment format. It is generated by the Clustalw multiple sequence alignment program. Many variants of this format are in common use. The first line must begin with the text CLUSTAL. The alignment is given in blocks of 50 residues with the aligned sequences appearing under each other. There are no sequence position numbers. Each sequence line begins with the sequence identifier code. For example:

CLUSTAL W(1.83) multiple sequence alignment


IXI_234         TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235         TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236         TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237         TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT
                                                                  

IXI_234         GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
IXI_235         GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAG
IXI_236         GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--G
IXI_237         GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--G
                                                                  

IXI_234         SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235         SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236         SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237         SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE

Beyond "clustal", the most widespread alignment formats are those generated by other popular programs:

markx: Variants (markx0, markx1, markx2, markx3, markx10) of this format are used by Bill Pearson's suite of FASTA programs. See http://rcr-www.med.nyu.edu/rcr/fastaman.html for more information.
MSF: MSF is the format used for multiple sequences by the Accelrys GCG suite, formerly known as the GCG Wisconsin Package. GCG is a commercial software package of programs and utilities for gene and protein analysis. For further information see http://www.accelrys.com/products/gcg/

In all the alignment formats except MSF, gaps inserted into the sequence during the alignment are indicated by the '-' character. In contrast, MSF format uses '.' (full stop) for gaps inside a sequence and '~' (tilde) for gaps at the ends of an alignment.

When editing alignments it is possible to use any text editor that is capable of writing files in plain text format. The drawback is that it is very easy to unwittingly break some rules of the format. Therefore, if you intend to manipulate or edit alignments substantially, investigate using a full-blown alignment editor such as mse. Such editors should have an option to save the alignment to a file in one or more of the standard formats.

5.4.2. Supported Alignment Formats

Some alignment formats can hold only a pair of sequences (pairwise alignment) whereas others can hold multiple sequences (multiple sequence alignment). You should never use a pairwise alignment format to hold a multiple sequence alignment as the file would be unparsable by EMBOSS and other systems.

An alignment can be read from a file in any of the standard sequence formats that are suitable for alignments. For descriptions and examples of the supported formats see Section A.3, “Supported Alignment Formats”.

The supported alignment formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Header (whether the standard EMBOSS alignment header is included), Minseq (minimum sequences in alignment), Maxseq (maximum sequences in alignment) and Description (short description of the format).

Table 5.5. Alignment formats
Output Format	Nuc	Pro	Header	Minseq	Maxseq	Description
clustal	Yes	Yes	No	0	0	clustalw format sequence
debug	Yes	Yes	Yes	0	0	Debugging trace of full internal data content
fasta	Yes	Yes	No	0	0	Fasta format sequence
markx0	Yes	Yes	No	2	2	Pearson MARKX0 format
markx1	Yes	Yes	No	2	2	Pearson MARKX1 format
markx10	Yes	Yes	No	2	2	Pearson MARKX10 format
markx2	Yes	Yes	No	2	2	Pearson MARKX2 format
markx3	Yes	Yes	No	2	2	Pearson MARKX3 format
match	Yes	Yes	Yes	2	2	Start and end of matches between sequence pairs
mega	Yes	Yes	No	0	0	Mega format sequence
meganon	Yes	Yes	No	0	0	Mega non-interleaved format sequence
msf	Yes	Yes	No	0	0	MSF format sequence
nexus	Yes	Yes	No	0	0	nexus/paup format sequence
nexusnon	Yes	Yes	No	0	0	nexus/paup non-interleaved format sequence
pair	Yes	Yes	Yes	2	2	Simple pairwise alignment
phylip	Yes	Yes	No	0	0	phylip format sequence
phylipnon	Yes	Yes	No	0	0	phylip non-interleaved format sequence
score	Yes	Yes	Yes	2	2	Score values for pairs of sequences
selex	Yes	Yes	No	0	0	SELEX format sequence
simple	Yes	Yes	Yes	0	0	Simple multiple alignment
srs	Yes	Yes	Yes	0	0	Simple multiple sequence format for SRS
srspair	Yes	Yes	Yes	2	2	Simple pairwise sequence format for SRS
tcoffee	Yes	Yes	No	0	0	TCOFFEE program format
treecon	Yes	Yes	No	0	0	Treecon format sequence

5.4.3. Contents of an Alignment File

All alignment formats excluding those (FASTA, MSF) that are also standard sequence formats, have a block of information (comments) at the start of the alignment describing the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment. For example:

########################################
# Program:  demoalign
# Rundate:  Thu Jan 17 09:30:08 2002
# Report_file: stdout
########################################
#=======================================
#
# Aligned_sequences: 4
# 1: IXI_234
# 2: IXI_235
# 3: IXI_236
# 4: IXI_237
# Matrix: EBLOSUM62
# Gap_penalty: 9
# Extend_penalty: -1
#
# Length: 131
# Identity:      95/131 (72.5%)
# Similarity:   127/131 (96.9%)
# Gaps:          25/131 (19.1%)
# Score:  100.0
#
#=======================================

Length.

# Length: 131

This line gives the length of the alignment, including any gaps that have been introduced to construct the alignment.

Identity.

# Identity:      95/131 (72.5%)

This line gives the number of positions (95) over the length of the alignment where all of the residues or bases at that position are identical. The alignment length (131) and the percentage of positions (72.5%) in the alignment where there are such identities are shown.

Similarity.

# Similarity:   127/131 (96.9%)

This line gives the number of positions over the length of the alignment where over 50% of the residues or bases at that position are similar to one another. Two residues or bases are defined as similar when their comparison score, as defined by the comparison matrix used by the alignment program, has a positive score. Again the alignment length (131) and the percentage (96.9%) of positions in the alignment where there are similarities is indicated.

Gaps.

# Gaps:          25/131 (19.1%)

This line gives the number of positions over the length of the alignment containing one or more gap characters. The alignment length (131) and percentage (19.1%) of positions in the alignment where there are gaps is given.

Score.

# Score: 100.0

This line gives the score for the alignment. The value depends on the alignment program, particularly the comparison matrix and gap penalties, and of course the sequences that were aligned. For an explanation of the score schemes refer to the relevant application documentation.

5.4.4. Specifying Alignments on the Command Line

Alignment inputs are referred to on the EMBOSS command line by their USA (Section 6.6, “The Uniform Sequence Address (USA)”). This is a standard sequence naming scheme used by all EMBOSS applications. In the case of alignments, a USA specifies two or more sequences that should be read from a file. Other sequence sources such as an application or a web server can also be specified.

Alignment outputs, in contrast, are not specified by a USA. If you want to write an alignment to a file in one of the standard alignment formats, you must specify a simple name for the file as you would for a standard output file.

You may also write aligned sequences to a file in one of the standard sequences formats (Section A.1, “Supported Sequence Formats”). These are the formats to choose when the sequences are to be used by another EMBOSS application. In such cases, the sequences given will still be in their aligned condition (i.e. may include gap characters) but the alignment will not be obvious as the sequences will not be lined up and sequence positions may not be indicated.

There are also a set of command line qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) that are used for alignment input and output. These allow you to set such things as the file name and format and the records to print to the standard alignment header (see above).

For example, here the water application is called on two input sequences (Seq1.seq and Seq2.seq) to generate an alignment output file (SeqOut.seq) in MSF format:

water Seq1.seq Seq2.seq SeqOut.seq -aformat msf

5.4.5. Applications for Sequence Alignment

Various programs for handling alignments (xref linkend="FormatsAlignmentIntroApps" />) are provided and are organised into groups of related functionality (xref linkend="FormatsAlignmentIntroGroups" />). These are either part of the main package or are included under the EMBASSY grouping (xref linkend="FormatsAlignmentIntroPackages" />).

Application Groups for Alignments

Application Group	Description
Consensus	Merging sequences to make a consensus
Differences	Finding differences between sequences
Dot_plots	Dot plot sequence comparisons
Global	Global sequence alignment
Local	Local sequence alignment
Multiple	Multiple sequence alignment

Alignment Applications

Group	Application	Description
Consensus	cons	Creates a consensus from multiple alignments
	megamerger	Merge two large overlapping nucleic acid sequences
	merger	Merge two overlapping sequences
Differences	diffseq	Find differences between nearly identical sequences
Dot_plots	dotmatcher	Displays a thresholded dotplot of two sequences
	dotpath	Non-overlapping wordmatch dotplot of two sequences
	dottup	Displays a wordmatch dotplot of two sequences
	polydot	Displays all-against-all dotplots of a set of sequences
Global	est2genome	Align EST and genomic DNA sequences
	needle	Needleman-Wunsch global alignment
	stretcher	Finds the best global alignment between two sequences
	esim4	Align an mRNA to a genomic DNA sequence
Local	matcher	Finds the best local alignments between two sequences
	seqmatchall	All-against-all comparison of a set of sequences
	supermatcher	Match large sequences against one or more other sequences
	water	Smith-Waterman local alignment
	wordmatch	Finds all exact matches of a given size between 2 sequences
Multiple	emma	Multiple alignment program - interface to ClustalW program
	infoalign	Information on a multiple sequence alignment
	plotcon	Plot quality of conservation of a sequence alignment
	prettyplot	Displays aligned sequences, with colouring and boxing
	showalign	Displays a multiple sequence alignment
	tranalign	Align nucleic coding regions given the aligned proteins
	mse	Multiple Sequence Editor

EMBASSY Packages for Alignments

EMBASSY package	Description
PHYLIPNEW	The PHYLIPNEW programs are EMBOSS conversions of the programs in Joe Felsenstein's PHYLIP package, version 3.68. The package is used for phylogenetic analysis.
MSE	The MSE package is a multiple sequence editor. The program was contributed to the EMBOSS package by the author, Will Gilbert, as one of the first EMBASSY programs. Users of the GCG package may find this program familiar - GCG converted an earlier (fortran) version of the same program to be their sequence assembly editor.
ESIM4	The ESIM4 package is an EMBOSS conversion of the SIM4 package from Liliana Florea. The esim4 application aligns an mRNA to a genomic DNA sequence.
HMMERNEW	EMBASSY HMMERNEW is a suite of application wrappers to the original hmmer v2.3.2 applications written by Sean Eddy. The ehmmalign application aligns sequences to an HMM profile and the ehmmbuild application builds a profile HMM from an alignment.

Prev	Up	Next
5.3. Introduction to Feature Formats	Home	5.5. Introduction to Report Formats