This tutorial will examine members of the rhodopsin family of G-protein coupled receptors. The general principles are, of course, applicable to any sequences you would like to analyse. You will be working with sequences retrieved from EMBL and SwissProt but you can also use EMBOSS with sequences in text files.
We will begin with two EMBL sequences whose identifiers are U23808
and L07770
; these sequences are the genomic and the corresponding cDNA sequence for Xenopus laevis rhodopsin.
You need to tell EMBOSS from where to read the sequence(s) you want to analyse. EMBOSS can read sequences either from text files or directly from a sequence database. The easiest way to see this is with examples.
The EMBOSS programs can read sequences from various sequence databases as long as the sequence is referred to in the form database:entry
. This format is an example of a Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”). You can see the databases that are set up using the program showdb.
A few databases available on our test machine are shown below. Your local site will probably have a different selection of databases depending on what the local EMBOSS maintainer has set up:
%
showdb
Displays information on the currently available databases #Name Type ID Qry All Comment #==== ==== == === === ======= nbrf P OK OK OK PIR/NBRF pir P OK OK OK PIR/NBRF embl P OK OK OK EMBL remtrembl P OK OK OK REMTREMBL sequences sptrembl P OK OK OK SPTREMBL sequences sw P OK OK OK SWISSPROT sequences swissprot P OK OK OK SWISSPROT sequences trarc P OK OK OK TREMBL ARC sequences trembl P OK OK OK TREMBL sequences tremblnew P OK OK OK New TREMBL sequences
showdb
writes out a simple table displaying the names, contents and access methods for the databases:
ID
allows programs to extract a single explicitly named entry from the database, for example: embl:x13776
Qry
indicates that programs can extract a set of matching wildcard entry names. For example: swissprot:pax*_human
All
allows programs to analyse all the entries in the database sequentially. For example: embl:*
You can access EMBL by either identifier e.g. L07770
, or accession number e.g. L07770
. Let's try these now.
seqret reads in a sequence and writes it out. It is probably the most commonly used EMBOSS program.
%
seqret
Reads and writes (returns) a sequence> Input sequence:embl:L07770
Output sequence [L07770.fasta]:%
more L07770.fasta
>L07770 L07770 Xenopus laevis rhodopsin mRNA, complete cds. ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac . . Output truncated for brevity .
Now let's retrieve the sequence using its the accession number:
%
seqret
Reads and writes (returns) a sequence Input sequence:embl:L07770
Output sequence [L07770.fasta]:L077702.fasta
%
more L077702.fasta
>L07770 L07770 Xenopus laevis rhodopsin mRNA, complete cds. ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac . . Output truncated for brevity .
You could also run this example entirely from the command line:
seqret embl:L07770 -outseq L07770.fasta |
By default, seqret writes the sequence in FASTA format. You can also tell it to use a different output format:
seqret embl:L07770 -outseq gcg::L07770.gcg |
As an alternative to specifying the format in the output sequence USA you can use the -osformat
qualifier. This command is identical in action to the previous one:
%
seqret embl:L07770 -outseq L07770.gcg -osformat gcg
%
more L07770.gcg
!!NA_SEQUENCE 1.0 Xenopus laevis rhodopsin mRNA, complete cds. L07770 Length: 1684 Type: N Check: 9453 .. 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga . . Output truncated for brevity .
The list of the formats that EMBOSS supports is described elsewhere (Section A.1, “Supported Sequence Formats”).
EMBOSS can also read sequences from files. For example, if you wanted to reformat the FASTA sequence you produced above into GCG format, you could type:
seqret L07770.fasta -outseq gcg::myseq.gcg |
or
seqret L07770.fasta -outseq myseq.gcg -osformat gcg |
infoseq is a small utility to list the sequences' USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description. To view this information for the sequence:
%
infoseq
Displays some simple information about sequences Input sequence(s): sw:opsd_* # USA Name Accession Type Length Description sw-id:OPSD_ABYKO OPSD_ABYKO O42294 P 289 RHODOPSIN (FRAGMENT). sw-id:OPSD_ALLMI OPSD_ALLMI P52202 P 352 RHODOPSIN. sw-id:OPSD_AMBTI OPSD_AMBTI Q90245 P 354 RHODOPSIN. sw-id:OPSD_ANGAN OPSD_ANGAN Q90214 P 352 RHODOPSIN, DEEP-SEA sw-id:OPSD_ANOCA OPSD_ANOCA P41591 P 352 RHODOPSIN. sw-id:OPSD_APIME OPSD_APIME Q17053 P 377 RHODOPSIN. sw-id:OPSD_ASTFA OPSD_ASTFA P41590 P 352 RHODOPSIN. sw-id:OPSD_BATMU OPSD_BATMU O42300 P 289 RHODOPSIN (FRAGMENT). sw-id:OPSD_BATNI OPSD_BATNI O42301 P 289 RHODOPSIN (FRAGMENT). sw-id:OPSD_BOVIN OPSD_BOVIN P02699 P 348 RHODOPSIN.
Sequence databases do not just contain sequences, they also contain a great deal of associated information (annotation) about the sequence entries. By default EMBOSS does not return all this information when you run seqret
.
To retrieve the full entry for a sequence in its original database form you can use the utility entret:
%
entret embl:U23808
Reads and writes (returns) flatfile entries Output file [U23808.entret]:%
more U23808.entret
ID U23808 standard; DNA; VRT; 4734 BP. XX AC U23808; XX SV U23808.1 XX DT 23-APR-1995 (Rel. 43, Created) DT 04-MAR-2000 (Rel. 63, Last updated, Version 7) XX DE Xenopus laevis rhodopsin gene, complete cds. XX KW . XX OS Xenopus laevis (African clawed frog) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Amphibia; OC Batrachia; Anura; Mesobatrachia; Pipoidea; Pipidae; Xenopodinae; Xenopus. XX . . Output truncated for brevity .
There is a lot of information here. Near the bottom, just above the sequence itself, is a list of features associated with the sequence. A feature is any defined region of the sequence that has a particular description associated with it. To view a simple graphical overview of the sequence features use the utility showfeat:
%
showfeat embl:U23808
Show features of a sequence. Output file [U23808.showfeat]:%
more U23808.showfeat
U23808 Xenopus laevis rhodopsin gene, complete cds. |==========================================================| 4734 |----------------------------------------------------------> source |-----> mRNA |---> CDS |-> CDS |-> mRNA |-> CDS |-> mRNA |--> CDS |--> mRNA |> CDS |-------> mRNA
To retrieve a sequence with all its features you can use the program seqret with the option -feature
. This has some properties of which to be aware:
%
seqret -feature embl:U23808
Reads and writes (returns) one or more sequences Output sequence [U23808.fasta]:
This looks pretty much like running seqret
except that a second file has been created, u23808.gff
. Take a look at this file:
%
more u23808.gff
##gff-version 3 ##sequence-region U23808 1 8914 #!Date 2009-02-18 #!Type DNA #!Source-version EMBOSS 6.1.0 U23808 EMBL databank_entry 1 8914 0.000 + . ID="U238 08.1";organism="Xenopus laevis";mol_type="genomic DNA";db_xref="taxon:8355" U23808 EMBL promoter 5128 5360 0.000 + . ID="U238 08.2" U23808 EMBL located_sequence_feature 5128 5158 0.000 + . ID="U23808.3";note="XOP4 cis element" U23808 EMBL located_sequence_feature 5191 5215 0.000 + . ID="U23808.4";note="XOP3 cis element" U23808 EMBL located_sequence_feature 5225 5239 0.000 + . ID="U23808.5";note="Ret1 cis element" . . Output truncated for brevity .
This is a list of the features in the database entry in GFF (General Feature Format). More information on feature formats is available elsewhere (Section 5.3, “Introduction to Feature Formats”). In order to change the feature formats and filename you need to use associated qualifiers when running seqret -feature
. Let's save the features in EMBL format in the file rhodop.features
:
%
seqret -feature embl:U23808 -offormat embl -ofname rhodop.features
Reads and writes (returns) one or more sequences Output sequence [U23808.fasta]:
And the file appears as expected.
We could use a Uniform Feature Object (UFO) instead of the -offormat
and -ofname
qualifiers:
seqret -feature embl:U23808 -oufo embl::rhodop.features |
EMBOSS programs can also deal with multiple sequences. A quick search using SRS will tell you that the SwissProt sequence corresponding to the EMBL sequence we've been looking at has the identifier OPSD_XENLA
. To retrieve the information about all the other OPSD sequences in SwissProt you can use the wildcard character:
%
infoseq
Displays some simple information about sequences Input sequence(s):sw:opsd_*
# USA Name Accession Type Length Description sw-id:OPSD_ABYKO OPSD_ABYKO O42294 P 289 RHODOPSIN (FRAGMENT). sw-id:OPSD_ALLMI OPSD_ALLMI P52202 P 352 RHODOPSIN. sw-id:OPSD_AMBTI OPSD_AMBTI Q90245 P 354 RHODOPSIN. sw-id:OPSD_ANGAN OPSD_ANGAN Q90214 P 352 RHODOPSIN, DEEP-SEA sw-id:OPSD_ANOCA OPSD_ANOCA P41591 P 352 RHODOPSIN. sw-id:OPSD_APIME OPSD_APIME Q17053 P 377 RHODOPSIN. sw-id:OPSD_ASTFA OPSD_ASTFA P41590 P 352 RHODOPSIN. sw-id:OPSD_BATMU OPSD_BATMU O42300 P 289 RHODOPSIN (FRAGMENT). sw-id:OPSD_BATNI OPSD_BATNI O42301 P 289 RHODOPSIN (FRAGMENT). sw-id:OPSD_BOVIN OPSD_BOVIN P02699 P 348 RHODOPSIN.
We can also use the wildcard character on the command line, but here you must enclose the specification in quotation marks to stop the UNIX shell trying to interpret the asterisk. One of:
infoseq "sw:opsd_*" |
infoseq sw:opsd_"*" |
The two commands are equivalent.
You can use seqret to retrieve multiple sequences into a file. For example:
seqret "sw:opsd_a*" -outseq opsd_a.seqs |
That retrieves all the sequences whose identifiers begin with "opsd_a" into a file called opsd_a.seqs
. If you wanted to have each sequence in a separate file, you could type:
seqret "sw:opsd_a*" -ossingle |
Filenames will be automatically generated based on the identifiers of the sequences.
It is also possible to use "listfiles" within EMBOSS. Instead of containing the sequences themselves, a listfile contains "references" to sequences - so, for example, you might include database entries, the names of files containing sequences, or even the names of other listfiles. You'll need to use a text editor to create the appropriate listfiles if you'd like to try this yourself.
Here's an example of a valid listfile, called seq.list
:
opsd_abyko.fasta sw:opsd_xenla sw:opsd_c* @another_list
If you have created this file then you can read it using:
more seq.list |
This may look a bit odd, but it's really very straightforward: The file contains:
opsd_abyko.fasta
- this is the name of a sequence file. The file is read in from the current directory.
sw:opsd_xenla
- this is a reference to a specific sequence in the SwissProt database
sw:opsd_c*
- this represents all the sequences in SwissProt whose identifiers start with "opsd_c"
another_list
- this is the name of a second listfile
Notice the @
in front of the last entry. This is the way you tell EMBOSS that this file is a listfile, not a regular sequence file. As an alternative to using the @
you can write list:filename
instead. Let's demonstrate this by using this file as the input to seqret and get the sequences into a new file, perhaps for use in a multiple sequence alignment.
First of all, make the file opsd_abyko.fasta
using seqret:
seqret sw:opsd_abyko -outseq opsd_abyko.fasta |
Now let's create another_list
. Note that its structure is very similar to that of seq.list
but this time only contains database references:
sw:opsd_anoca sw:opsd_apime sw:opsd_astfa
After you have created this file you will be able to view it using:
more another_list |
Finally, let's run seqret with seq.list
(not forgetting the @ sign) and look at the results:
%
seqret @seq.list -outseq outfile
%
more outfile
>OPSD_ABYKO O42294 RHODOPSIN (FRAGMENT). YLVNPAAYAALGAYMFLLILIGFPINFLTLYVTLEHKKLRTPLNYILLNLAVANLFMVLG GFTTTMYTSMHGYFVLGRLGCNLEAFFATLGGEIALWSLVVLAIERWIVVCKPISNFRFT EDHAIMGLAFTWVMALACAVPPLVGWSRYIPEGMQCSCGVDYYTRAEGFNNESFVIYMFI VHFLIPLSVIFFCYGRLLCAVKEAPAAQQESETTQRAEKEVSRMVVIMVIGFLVCWLPYA SVAWWIFCNQGSDFGPIFMTLPSFFAKSAAIYNPMIYICMNKQFRHCMI >OPSD_XENLA P29403 RHODOPSIN. MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMFLLILLGLPINFMTLF VTIQHKKLRTPLNYILLNLVFANHFMVLCGFTVTMYTSMHGYFIFGPTGCYIEGFFATLG GEVALWSLVVLAVERYIVVCKPMANFRFGENHAIMGVAFTWIMALSCAAPPLFGWSRYIP EGMQCSCGVDYYTLKPEVNNESFVIYMFIVHFTIPLIVIFFCYGRLLCTVKEAAAQQQES LTTQKAEKEVTRMVVIMVVFFLICWVPYAYVAFYIFTHQGSNFGPVFMTVPAFFAKSSAI YNPVIYIVLNKQFRNCLITTLCCGKNPFGDEDGSSAATSKTEASSVSSSQVSPA >OPSD_CAMAB Q17292 RHODOPSIN. MMSIASGPSHAAYTWASQGGGFGNQTVVDKVPPEMLHMVDAHWYQFPPMNPLWHALLGFV IGVLGVISVIGNGMVIYIFTTTKSLRTPSNLLVVNLAISDFLMMLCMSPAMVINCYYETW VLGPLFCELYGLAGSLFGCASIWTMTMIAFDRYNVIVKGLSAKPMTINGALIRILTIWFF TLAWTIAPMFGWNRYVPEGNMTACGTDYLTKDLFSRSYILIYSIFVYFTPLFLIIYSYFF IIQAVAAHEKNMREQAKKMNVASLRSAENQSTSAECKLAKVALMTISLWFMAWTPYLVIN YSGIFETTKISPLFTIWGSLFAKANAVYNPIVYGISHPKYRAALFQKFPSLACTTEPTGA DTMSTTTTVTEGNEKPAA >OPSD_CAMHU O18312 RHODOPSIN (FRAGMENT). LHMIHLHWYQYPPMNPMMYPLLLIFMLFTGILCLAGNFVTIWVFMNTKSLRTPANLLVVN LAMSDFLMMFTMFPPMMVTCYYHTWTLGPTFCQVYAFLGNLCGCASIWTMVFITFDRYNV IVKGVAGEPLSTKKASLWILSVWVLSTAWCIAPFFGWNHYVPEGNLTGCGTDYLSEDILS RSYLYIYSTWVYFLPLAITIYCYVFIIKAVAAHEKGMRDQAKKMGIKSLRNEEAQKTSAE CRLAKNAMTTVALWFIAWTPCLLINWVGMFARSYLSPVYTIWGYVFAKANAVYNPIVYAI S . . Output truncated for brevity .
Note that the output file contains all the sequences you specified in seq.list
, as you had expected.