4.3. Working with Alignments

EMBOSS includes tools for local and global sequence alignments and associated analyses such as dotplots. You will take a look at these now.

4.3.1. Pairwise Sequence Alignment

There is no unique, precise, or universally applicable notion of sequence similarity. An alignment is an arrangement of two sequences which shows where the two sequences are similar, and where they differ. An optimal alignment, of course, is one that exhibits the most significant similarities, and the least differences. Broadly, there are three categories of methods for sequence comparison:

  • Segment methods compare all windows (overlapping segments of a predetermined length, e.g. 10 amino acids) from one sequence to all segments from the other. This is the approach used in dotplots.

  • Optimal global alignment methods allow the best overall score for the comparison of the two sequences to be obtained, including a consideration of gaps.

  • Optimal local alignment algorithms seek to identify the best local similarities between two sequences but, unlike segment methods, include explicit consideration of gaps.

4.3.2. Dotplots

The most intuitive representation of the comparison between two sequences uses dot plots. One sequence is represented on each axis and significant matching regions are distributed along diagonals in the matrix.

4.3.3. Exercise: Making a Dotplot

% dottup
DNA sequence dot plot
Input sequence: embl:U23808
Second sequence: embl:L07770
Word size [4]: 10
Graph type [x11]:

A window will pop up on your screen that should look something like this:

Making a Dotplot

The diagonal lines represent areas where the two sequences align well. You can see that there are five clear diagonals. You will remember that you are aligning genomic and cDNA - these five diagonals represent the five exons of the gene! If you look at the original EMBL entry for the genomic sequence, you will see that the annotated entry says that there are five exons in this gene. So the results are in agreement.

The settings you have used for this example are those that give the best results. dottup looks for exact matches between sequences. As you expect the exon regions from the genomic sequence to exactly match the cDNA sequence you can use longer word lengths as you should still get exact matches. This gives a very clean plot. If you were to match the cDNA sequence against that of a related sequence, e.g. the rhodopsin from mouse (embl:m55171) then you wouldn't expect long exact matches so should use a shorter word length.

4.3.4. Exercise: Examining Dotplot Parameters

Repeat the previous example but compare the frog rhodopsin cDNA against the mouse genomic DNA:

% dottup embl:m55171 embl:L07770
DNA sequence dot plot
Word size [4]: 10
Graph type [x11]:

Repeat this for both comparisons using different word sizes. What do you notice? Which word sizes give the clearest plot? Why are the diagonals in the first and last exons not so clear? (hint: look back at the results for showfeat).

The dotplot doesn't give us any detailed sequence information. For this, you need to use different programs. The alignment algorithms you will be using are more rigorous than those used for searching databases; so even if you have retrieved a sequence from a database using something like BLAST, it will be well worth your while performing a careful pairwise alignment afterwards. The basic idea behind the sequence alignment programs is to align the two sequences in such a way as to produce the highest score. A scoring matrix is used to add points to the score for each match and subtract them for each mismatch. The matrices used for nucleic acid alignments tend to involve fairly simple match/mismatch scoring schemes, while the matrices commonly used for scoring protein alignments are more complex, with scores designed to reflect similarity between the different amino acids rather than simply scoring identities. Over time various mutations occur in sequences. The scoring matrices attempt to cope with mutations, but insertions and deletions require some extra parameters to allow the introduction of gaps in the alignment. There are penalties both for the creation of gaps and for the extension of existing ones; the default gap parameters given in alignment programs have been found to be empirically correct with test sequences but you should experiment with different gap penalties.

4.3.5. Global Alignment

A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. The alignment maximises regions of similarity and minimises gaps using the scoring matrices and gap parameters provided to the program. The EMBOSS program needle is an implementation of the Needleman-Wunsch algorithm for global alignment; the computation is rigorous and needle can be time consuming to run if the sequences are long.

4.3.6. Exercise: needle

% needle
Needleman-Wunsch global alignment.
Input sequence:embl:L07770
Second sequence(s):embl:U23808
Gap opening penalty [10.0]:
Gap extension penalty [0.5]:
Output file [l07770.needle]:


% more l07770.needle
########################################
# Program: needle
# Rundate: Wed 18 Feb 2009 11:44:20
# Commandline: needle
#    -asequence embl:L07770
#    -bsequence embl:U23808
# Align_format: srspair
# Report_file: l07770.needle
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: L07770
# 2: U23808
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 8914
# Identity:    1683/8914 (18.9%)
# Similarity:  1683/8914 (18.9%)
# Gaps:        7230/8914 (81.1%)
# Score: 7471.0
# 
#
#=======================================

L07770             0 --------------------------------------------------      0
                                                                       
U23808             1 ggatccatgttagtagccttcaaatcgagtttagtctaatttgaatcaaa     50
.
. Output truncated for brevity
.

Note that as this is a global alignment, the entire genomic sequence is given in the output, even in regions where it does not line up with the cDNA. Scroll down the output until you reach an area of alignment:

L07770           470 ---------------------------ggtgaagtggccctctggtcact    492
                                                |||||||||||||||||||||||
U23808          6051 tgccttttcctgttctttttactaacaggtgaagtggccctctggtcact   6100

L07770           493 ggtagtattggccgttgaaagatatatggtggtctgcaagcccatggcca    542
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          6101 ggtagtattggccgttgaaagatatatggtggtctgcaagcccatggcca   6150

L07770           543 acttccgattcggggagaaccatgctattatgggtgtagccttcacatgg    592
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          6151 acttccgattcggggagaaccatgctattatgggtgtagccttcacatgg   6200

L07770           593 atcatggctttgtcttgtgctgctcctcctctcttcggatggtc------    636
                     ||||||||||||||||||||||||||||||||||||||||||||      
U23808          6201 atcatggctttgtcttgtgctgctcctcctctcttcggatggtccaggta   6250

Only part of the output is shown as it is very long. You should look at the whole output and note that there are five aligned regions that represent the five exons as predicted from the dotplot. Look carefully at the boundaries of the aligned regions. It is known that intron/exon boundaries have a conserved gt..ag pair of dinucleotides delimiting the splice sites. It is most unlikely that needle has correctly aligned these boundaries as it has no understanding of models of gene structure. The scoring method it uses does not specifically treat splice sites. The program est2genome has an extra scoring factor that allows it to do a better job of aligning intron/exon boundaries.

4.3.7. Local Alignment

As mentioned above, global sequence alignment algorithms align sequences over their entire lengths. You do need to think about whether that type of alignment makes sense for your sequences. For the example, where you expect each exon to be represented in the sequences and in the same order, it has worked well - however, how well do you think this approach would work with, for example, multidomain proteins that share one domain but not others, or sequences where there have been regions of duplication?

A second comparison method, local alignment, searches for regions of local similarity and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or when you do not know that the sequences are similar over their entire lengths. The EMBOSS program water is a rigorous implementation of the Smith Waterman algorithm for local alignments.

4.3.8. Exercise: water

% water
Smith-Waterman local alignment.
Input sequence: embl:L07770
Second sequence(s): embl:U23808
Gap opening penalty [10.0]:
Gap extension penalty [0.5]:
Output file [l07770.water]:

% more l07770.water

########################################
# Program: water
# Rundate: Wed 18 Feb 2009 11:55:05
# Commandline: water
#    -asequence embl:L07770
#    -bsequence embl:U23808
# Align_format: srspair
# Report_file: l07770.water
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: L07770
# 2: U23808
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 3487
# Identity:    1683/3487 (48.3%)
# Similarity:  1683/3487 (48.3%)
# Gaps:        1804/3487 (51.7%)
# Score: 7475.0
# 
#
#=======================================

L07770             2 gtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaa     51
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5362 gtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaa   5411

L07770            52 aaaagaaacacagaaggcattctttctatacaagaaaggactttatagag    101
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5412 aaaagaaacacagaaggcattctttctatacaagaaaggactttatagag   5461

L07770           102 ctgctaccatgaacggaacagaaggtccaaatttttatgtccccatgtcc    151
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5462 ctgctaccatgaacggaacagaaggtccaaatttttatgtccccatgtcc   5511

L07770           152 aacaaaactggggtggtacgaagcccattcgattaccctcagtattactt    201
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5512 aacaaaactggggtggtacgaagcccattcgattaccctcagtattactt   5561

L07770           202 agcagagccatggcaatattcagcactggctgcttacatgttcctgctca    251
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5562 agcagagccatggcaatattcagcactggctgcttacatgttcctgctca   5611

L07770           252 tcctgcttgggttaccaatcaacttcatgaccttgtttgttaccatccag    301
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5612 tcctgcttgggttaccaatcaacttcatgaccttgtttgttaccatccag   5661

L07770           302 cacaagaaactcagaacacccctaaactacatcctgctgaacctggtatt    351
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5662 cacaagaaactcagaacacccctaaactacatcctgctgaacctggtatt   5711

L07770           352 tgccaatcacttcatggtcctgtgtgggttcacggtgacaatgtacacct    401
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5712 tgccaatcacttcatggtcctgtgtgggttcacggtgacaatgtacacct   5761

L07770           402 caatgcacggctacttcatctttggccaaactggttgctacattgaaggc    451
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
U23808          5762 caatgcacggctacttcatctttggccaaactggttgctacattgaaggc   5811

L07770           452 ttctttgctacacttggt--------------------------------    469
                     ||||||||||||||||||                                
U23808          5812 ttctttgctacacttggtggtaagttccaatgggctttcgtcactgatat   5861
.
. Output truncated for brevity
.

Scroll down the entire output and again, note that five exons have been found. Such results need not always be the case as local alignment programs may only report the best matching section even when there are other sections that have reasonable similarity.

In these cases you have not had to adjust the gap parameters from the defaults used in these programs. You should be aware that you may need to do so with your own sequences.

EMBOSS contains other pairwise alignment programs - stretcher and matcher are global and local alignment programs respectively that are less rigorous than needle and water and therefore run more quickly; they should be used for sequences too large for water and needle. supermatcher is designed for local alignments of very large sequences and is even less rigorous in its implementation. The documentation pages for all these programs can be found at http://emboss.open-bio.org/rel/dev/apps/