4.4. Protein Analysis

A few of the EMBOSS applications that can be used to analyse protein sequences are now introduced. Obviously, the pairwise sequence comparison methods illustrated before with nucleic acid sequences can also be used with protein sequences.

4.4.1. Identifying the ORF

In this section you'll see some simple EMBOSS applications for translating your cDNA sequence into protein. You should be aware that gene structure prediction is a tough problem, and recognising exon/intron boundaries in genomic sequences is not easy; for now, rather than deal with that aspect of prediction, you'll use the cDNA sequence introduced above. First, you need to identify the open reading frame. You can get a rapid visual overview of the distribution of ORFs in the six frames of the sequence using the EMBOSS program plotorf.

4.4.2. Exercise: plotorf

% plotorf
Plot potential open reading frames
Input sequence: embl:L07770
Graph type [x11]:

You'll see a graphical output that shows the potential open reading frames (ORF) in all six frames:


The longest ORF is in frame 2 from around position 100 to 1200. You will now identify the exact start and end points for the translation. To do this you can use the EMBOSS program getorf.

4.4.3. Exercise: getorf

% getorf -options
Finds and extracts open reading frames (ORFs)
Input nucleotide sequence(s): embl:L07770
protein output sequence(s): [L07770.orf]:
Genetic codes
         0 : Standard
         1 : Standard (with alternative initiation codons)
         2 : Vertebrate Mitochondrial
         3 : Yeast Mitochondrial
         4 : Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
         5 : Invertebrate Mitochondrial
         6 : Ciliate Macronuclear and Dasycladacean
         9 : Echinoderm Mitochondrial
        10 : Euplotid Nuclear
        11 : Bacterial
        12 : Alternative Yeast Nuclear
        13 : Ascidian Mitochondrial
        14 : Flatworm Mitochondrial
        15 : Blepharisma Macronuclear
        16 : Chlorophycean Mitochondrial
        21 : Trematode Mitochondrial
        22 : Scenedesmus obliquus
        23 : Thraustochytrium Mitochondrial
Code to use [0]:
Minimum nucleotide size of ORF to report [30]:
Maximum nucleotide size of ORF to report [1000000]:
Type of sequence to output
0 : Translation of regions between STOP codons
1 : Translation of regions between START and STOP codons
2 : Nucleic sequences between STOP codons
3 : Nucleic sequences between START and STOP codons
4 : Nucleotides flanking START codons
5 : Nucleotides flanking initial STOP codons
6 : Nucleotides flanking ending STOP codons
Type of output [0]: 3
protein output sequence(s) [l07770.orf]:

Notice that you can specify the organism whose codon usage table is most appropriate for your sequence, and you can also choose the type of information that is reported to you. In this case, you are simply interested in the positions of the start and stop codons for this sequence.

plotorf is just a graphical representation of the textual information produced by getorf. Since you asked for all ORFs above a minimum size to be reported, getorf is telling us about a number of potential ORFs. You know from plotorf that the ORF will be in the region 100 to 1200, so scroll through the output file, l07770.orf, until you identify this. What are the actual start and end positions?

% more l07770.orf
>L07770_7 [110 - 1171] Xenopus laevis  rhodopsin mRNA, complete cds.
. Output truncated for brevity

4.4.4. Translating the Sequence

From the previous exercise you should have found that the region to be translated is from 110 to 1171 in the cDNA sequence. Now you can use transeq to translate that region and use the translated peptide for some further analyses.

4.4.5. Exercise: transeq

Let's practice using command line flags (qualifiers) again. The new ones here are -sbegin and -send. These allow you to specify a subregion of your sequence; in this case you will ask transeq to translate only the part of embl:L07770 that you have identified as the coding region. You should remember -outseq from before:

% transeq embl:L07770 -sbegin 110 -send 1171 -outseq L07770.pep

Translate nucleic acid sequences

% more L07770.pep

>L07770+1 Xenopus laevis rhodopsin mRNA, complete cds.

We saw earlier that the SwissProt entry for this protein has the identifier opsd_xenla; test your understanding of EMBOSS so far by using needle to compare your translated product with the database sequence. Compare your findings with the SwissProt entry.

4.4.6. USA for Partial Sequences

As an alternative to -sbegin and -send you can specify start, end and whether to reverse complement as part of the sequence USA. The format to use is db:sequence[start:end] (or db:sequence[start:end:r] to reverse complement). Start must be smaller than end. If you want to use the actual start and end then use the value 0 instead of positions. If you want to count from the end of the sequence rather than from the beginning then use negative numbers. For examples see ???.


Residues 10-20sw:opsd_xenla[10:20]
The last ten residuessw:opsd_xenla[-10:0]
The last twenty residues bar 5sw:opsd_xenla[-20:-6]
bases 134-458 reverse complementembl:L07770[134:458:r]