A few of the EMBOSS applications that can be used to analyse protein sequences are now introduced. Obviously, the pairwise sequence comparison methods illustrated before with nucleic acid sequences can also be used with protein sequences.
In this section you'll see some simple EMBOSS applications for translating your cDNA sequence into protein. You should be aware that gene structure prediction is a tough problem, and recognising exon/intron boundaries in genomic sequences is not easy; for now, rather than deal with that aspect of prediction, you'll use the cDNA sequence introduced above. First, you need to identify the open reading frame. You can get a rapid visual overview of the distribution of ORFs in the six frames of the sequence using the EMBOSS program plotorf.
%
plotorf
Plot potential open reading frames Input sequence:embl:L07770
Graph type [x11]:
You'll see a graphical output that shows the potential open reading frames (ORF) in all six frames:
The longest ORF is in frame 2 from around position 100 to 1200. You will now identify the exact start and end points for the translation. To do this you can use the EMBOSS program getorf.
%
getorf -options
Finds and extracts open reading frames (ORFs) Input nucleotide sequence(s):embl:L07770
protein output sequence(s): [L07770.orf]: Genetic codes 0 : Standard 1 : Standard (with alternative initiation codons) 2 : Vertebrate Mitochondrial 3 : Yeast Mitochondrial 4 : Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma 5 : Invertebrate Mitochondrial 6 : Ciliate Macronuclear and Dasycladacean 9 : Echinoderm Mitochondrial 10 : Euplotid Nuclear 11 : Bacterial 12 : Alternative Yeast Nuclear 13 : Ascidian Mitochondrial 14 : Flatworm Mitochondrial 15 : Blepharisma Macronuclear 16 : Chlorophycean Mitochondrial 21 : Trematode Mitochondrial 22 : Scenedesmus obliquus 23 : Thraustochytrium Mitochondrial Code to use [0]: Minimum nucleotide size of ORF to report [30]: Maximum nucleotide size of ORF to report [1000000]: Type of sequence to output 0 : Translation of regions between STOP codons 1 : Translation of regions between START and STOP codons 2 : Nucleic sequences between STOP codons 3 : Nucleic sequences between START and STOP codons 4 : Nucleotides flanking START codons 5 : Nucleotides flanking initial STOP codons 6 : Nucleotides flanking ending STOP codons Type of output [0]:3
protein output sequence(s) [l07770.orf]:
Notice that you can specify the organism whose codon usage table is most appropriate for your sequence, and you can also choose the type of information that is reported to you. In this case, you are simply interested in the positions of the start and stop codons for this sequence.
plotorf is just a graphical representation of the textual information produced by getorf. Since you asked for all ORFs above a minimum size to be reported, getorf is telling us about a number of potential ORFs. You know from plotorf that the ORF will be in the region 100 to 1200, so scroll through the output file, l07770.orf, until you identify this. What are the actual start and end positions?
%
more l07770.orf
. . . >L07770_7 [110 - 1171] Xenopus laevis rhodopsin mRNA, complete cds. atgaacggaacagaaggtccaaatttttatgtccccatgtccaacaaaactggggtggta cgaagcccattcgattaccctcagtattacttagcagagccatggcaatattcagcactg . . Output truncated for brevity .
From the previous exercise you should have found that the region to be translated is from 110 to 1171 in the cDNA sequence. Now you can use transeq to translate that region and use the translated peptide for some further analyses.
Let's practice using command line flags (qualifiers) again. The new ones here are -sbegin
and -send
. These allow you to specify a subregion of your sequence; in this case you will ask transeq to translate only the part of embl:L07770
that you have identified as the coding region. You should remember -outseq
from before:
%
transeq embl:L07770 -sbegin 110 -send 1171 -outseq L07770.pep
Translate nucleic acid sequences%
more L07770.pep
>L07770+1 Xenopus laevis rhodopsin mRNA, complete cds. MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMFLLILLGLPINFMTLF VTIQHKKLRTPLNYILLNLVFANHFMVLCGFTVTMYTSMHGYFIFGQTGCYIEGFFATLG GEVALWSLVVLAVERYMVVCKPMANFRFGENHAIMGVAFTWIMALSCAAPPLFGWSRYIP EGMQCSCGVDYYTLKPEVNNESFVIYMFIVHFTIPLIVIFFCYGRLLCTVKEAAAQQQES ATTQKAEKEVTRMVVIMVVFFLICWVPYAYVAFYIFTHQGSNFGPVFMTVPAFFAKSSAI YNPVIYIVLNKQFRNCLITTLCCGKNPFGDEDGSSAATSKTEASSVSSSQVSPA
We saw earlier that the SwissProt entry for this protein has the identifier opsd_xenla
; test your understanding of EMBOSS so far by using needle to compare your translated product with the database sequence. Compare your findings with the SwissProt entry.
As an alternative to -sbegin
and -send
you can specify start, end and whether to reverse complement as part of the sequence USA. The format to use is db:sequence[start:end]
(or db:sequence[start:end:r]
to reverse complement). Start must be smaller than end. If you want to use the actual start and end then use the value 0 instead of positions. If you want to count from the end of the sequence rather than from the beginning then use negative numbers. For examples see ???.
Residues 10-20 | sw:opsd_xenla[10:20] |
The last ten residues | sw:opsd_xenla[-10:0] |
The last twenty residues bar 5 | sw:opsd_xenla[-20:-6] |
bases 134-458 reverse complement | embl:L07770[134:458:r] |