EMBOSS includes tools for local and global sequence alignments and associated analyses such as dotplots. You will take a look at these now.
There is no unique, precise, or universally applicable notion of sequence similarity. An alignment is an arrangement of two sequences which shows where the two sequences are similar, and where they differ. An optimal alignment, of course, is one that exhibits the most significant similarities, and the least differences. Broadly, there are three categories of methods for sequence comparison:
Segment methods compare all windows (overlapping segments of a predetermined length, e.g. 10 amino acids) from one sequence to all segments from the other. This is the approach used in dotplots.
Optimal global alignment methods allow the best overall score for the comparison of the two sequences to be obtained, including a consideration of gaps.
Optimal local alignment algorithms seek to identify the best local similarities between two sequences but, unlike segment methods, include explicit consideration of gaps.
The most intuitive representation of the comparison between two sequences uses dot plots. One sequence is represented on each axis and significant matching regions are distributed along diagonals in the matrix.
%
dottup
DNA sequence dot plot Input sequence:embl:U23808
Second sequence:embl:L07770
Word size [4]:10
Graph type [x11]:
A window will pop up on your screen that should look something like this:
The diagonal lines represent areas where the two sequences align well. You can see that there are five clear diagonals. You will remember that you are aligning genomic and cDNA - these five diagonals represent the five exons of the gene! If you look at the original EMBL entry for the genomic sequence, you will see that the annotated entry says that there are five exons in this gene. So the results are in agreement.
The settings you have used for this example are those that give the best results. dottup looks for exact matches between sequences. As you expect the exon regions from the genomic sequence to exactly match the cDNA sequence you can use longer word lengths as you should still get exact matches. This gives a very clean plot. If you were to match the cDNA sequence against that of a related sequence, e.g. the rhodopsin from mouse (embl:m55171
) then you wouldn't expect long exact matches so should use a shorter word length.
Repeat the previous example but compare the frog rhodopsin cDNA against the mouse genomic DNA:
%
dottup embl:m55171 embl:L07770
DNA sequence dot plot Word size [4]:10
Graph type [x11]:
Repeat this for both comparisons using different word sizes. What do you notice? Which word sizes give the clearest plot? Why are the diagonals in the first and last exons not so clear? (hint: look back at the results for showfeat).
The dotplot doesn't give us any detailed sequence information. For this, you need to use different programs. The alignment algorithms you will be using are more rigorous than those used for searching databases; so even if you have retrieved a sequence from a database using something like BLAST, it will be well worth your while performing a careful pairwise alignment afterwards. The basic idea behind the sequence alignment programs is to align the two sequences in such a way as to produce the highest score. A scoring matrix is used to add points to the score for each match and subtract them for each mismatch. The matrices used for nucleic acid alignments tend to involve fairly simple match/mismatch scoring schemes, while the matrices commonly used for scoring protein alignments are more complex, with scores designed to reflect similarity between the different amino acids rather than simply scoring identities. Over time various mutations occur in sequences. The scoring matrices attempt to cope with mutations, but insertions and deletions require some extra parameters to allow the introduction of gaps in the alignment. There are penalties both for the creation of gaps and for the extension of existing ones; the default gap parameters given in alignment programs have been found to be empirically correct with test sequences but you should experiment with different gap penalties.
A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. The alignment maximises regions of similarity and minimises gaps using the scoring matrices and gap parameters provided to the program. The EMBOSS program needle is an implementation of the Needleman-Wunsch algorithm for global alignment; the computation is rigorous and needle can be time consuming to run if the sequences are long.
%
needle
Needleman-Wunsch global alignment. Input sequence:embl:L07770
Second sequence(s):embl:U23808
Gap opening penalty [10.0]: Gap extension penalty [0.5]: Output file [l07770.needle]:%
more l07770.needle
######################################## # Program: needle # Rundate: Wed 18 Feb 2009 11:44:20 # Commandline: needle # -asequence embl:L07770 # -bsequence embl:U23808 # Align_format: srspair # Report_file: l07770.needle ######################################## #======================================= # # Aligned_sequences: 2 # 1: L07770 # 2: U23808 # Matrix: EDNAFULL # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 8914 # Identity: 1683/8914 (18.9%) # Similarity: 1683/8914 (18.9%) # Gaps: 7230/8914 (81.1%) # Score: 7471.0 # # #======================================= L07770 0 -------------------------------------------------- 0 U23808 1 ggatccatgttagtagccttcaaatcgagtttagtctaatttgaatcaaa 50 . . Output truncated for brevity .
Note that as this is a global alignment, the entire genomic sequence is given in the output, even in regions where it does not line up with the cDNA. Scroll down the output until you reach an area of alignment:
L07770 470 ---------------------------ggtgaagtggccctctggtcact 492 ||||||||||||||||||||||| U23808 6051 tgccttttcctgttctttttactaacaggtgaagtggccctctggtcact 6100 L07770 493 ggtagtattggccgttgaaagatatatggtggtctgcaagcccatggcca 542 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 6101 ggtagtattggccgttgaaagatatatggtggtctgcaagcccatggcca 6150 L07770 543 acttccgattcggggagaaccatgctattatgggtgtagccttcacatgg 592 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 6151 acttccgattcggggagaaccatgctattatgggtgtagccttcacatgg 6200 L07770 593 atcatggctttgtcttgtgctgctcctcctctcttcggatggtc------ 636 |||||||||||||||||||||||||||||||||||||||||||| U23808 6201 atcatggctttgtcttgtgctgctcctcctctcttcggatggtccaggta 6250
Only part of the output is shown as it is very long. You should look at the whole output and note that there are five aligned regions that represent the five exons as predicted from the dotplot. Look carefully at the boundaries of the aligned regions. It is known that intron/exon boundaries have a conserved gt..ag
pair of dinucleotides delimiting the splice sites. It is most unlikely that needle has correctly aligned these boundaries as it has no understanding of models of gene structure. The scoring method it uses does not specifically treat splice sites. The program est2genome has an extra scoring factor that allows it to do a better job of aligning intron/exon boundaries.
As mentioned above, global sequence alignment algorithms align sequences over their entire lengths. You do need to think about whether that type of alignment makes sense for your sequences. For the example, where you expect each exon to be represented in the sequences and in the same order, it has worked well - however, how well do you think this approach would work with, for example, multidomain proteins that share one domain but not others, or sequences where there have been regions of duplication?
A second comparison method, local alignment, searches for regions of local similarity and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or when you do not know that the sequences are similar over their entire lengths. The EMBOSS program water is a rigorous implementation of the Smith Waterman algorithm for local alignments.
%
water
Smith-Waterman local alignment. Input sequence:embl:L07770
Second sequence(s):embl:U23808
Gap opening penalty [10.0]: Gap extension penalty [0.5]: Output file [l07770.water]:%
more l07770.water
######################################## # Program: water # Rundate: Wed 18 Feb 2009 11:55:05 # Commandline: water # -asequence embl:L07770 # -bsequence embl:U23808 # Align_format: srspair # Report_file: l07770.water ######################################## #======================================= # # Aligned_sequences: 2 # 1: L07770 # 2: U23808 # Matrix: EDNAFULL # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 3487 # Identity: 1683/3487 (48.3%) # Similarity: 1683/3487 (48.3%) # Gaps: 1804/3487 (51.7%) # Score: 7475.0 # # #======================================= L07770 2 gtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaa 51 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5362 gtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaa 5411 L07770 52 aaaagaaacacagaaggcattctttctatacaagaaaggactttatagag 101 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5412 aaaagaaacacagaaggcattctttctatacaagaaaggactttatagag 5461 L07770 102 ctgctaccatgaacggaacagaaggtccaaatttttatgtccccatgtcc 151 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5462 ctgctaccatgaacggaacagaaggtccaaatttttatgtccccatgtcc 5511 L07770 152 aacaaaactggggtggtacgaagcccattcgattaccctcagtattactt 201 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5512 aacaaaactggggtggtacgaagcccattcgattaccctcagtattactt 5561 L07770 202 agcagagccatggcaatattcagcactggctgcttacatgttcctgctca 251 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5562 agcagagccatggcaatattcagcactggctgcttacatgttcctgctca 5611 L07770 252 tcctgcttgggttaccaatcaacttcatgaccttgtttgttaccatccag 301 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5612 tcctgcttgggttaccaatcaacttcatgaccttgtttgttaccatccag 5661 L07770 302 cacaagaaactcagaacacccctaaactacatcctgctgaacctggtatt 351 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5662 cacaagaaactcagaacacccctaaactacatcctgctgaacctggtatt 5711 L07770 352 tgccaatcacttcatggtcctgtgtgggttcacggtgacaatgtacacct 401 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5712 tgccaatcacttcatggtcctgtgtgggttcacggtgacaatgtacacct 5761 L07770 402 caatgcacggctacttcatctttggccaaactggttgctacattgaaggc 451 |||||||||||||||||||||||||||||||||||||||||||||||||| U23808 5762 caatgcacggctacttcatctttggccaaactggttgctacattgaaggc 5811 L07770 452 ttctttgctacacttggt-------------------------------- 469 |||||||||||||||||| U23808 5812 ttctttgctacacttggtggtaagttccaatgggctttcgtcactgatat 5861 . . Output truncated for brevity .
Scroll down the entire output and again, note that five exons have been found. Such results need not always be the case as local alignment programs may only report the best matching section even when there are other sections that have reasonable similarity.
In these cases you have not had to adjust the gap parameters from the defaults used in these programs. You should be aware that you may need to do so with your own sequences.
EMBOSS contains other pairwise alignment programs - stretcher and matcher are global and local alignment programs respectively that are less rigorous than needle and water and therefore run more quickly; they should be used for sequences too large for water and needle. supermatcher is designed for local alignments of very large sequences and is even less rigorous in its implementation. The documentation pages for all these programs can be found at http://emboss.open-bio.org/rel/dev/apps/