4.2. Working with Sequences

This tutorial will examine members of the rhodopsin family of G-protein coupled receptors. The general principles are, of course, applicable to any sequences you would like to analyse. You will be working with sequences retrieved from EMBL and SwissProt but you can also use EMBOSS with sequences in text files.

We will begin with two EMBL sequences whose identifiers are U23808 and L07770; these sequences are the genomic and the corresponding cDNA sequence for Xenopus laevis rhodopsin.

You need to tell EMBOSS from where to read the sequence(s) you want to analyse. EMBOSS can read sequences either from text files or directly from a sequence database. The easiest way to see this is with examples.

4.2.1. Retrieving Sequences from Databases

The EMBOSS programs can read sequences from various sequence databases as long as the sequence is referred to in the form database:entry. This format is an example of a Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”). You can see the databases that are set up using the program showdb.

4.2.2. Exercise: showdb

A few databases available on our test machine are shown below. Your local site will probably have a different selection of databases depending on what the local EMBOSS maintainer has set up:

% showdb
Displays information on the currently available databases

#Name     Type   ID   Qry  All   Comment
#====     ====   ==  ===   ===   =======
nbrf      P      OK  OK    OK    PIR/NBRF
pir       P      OK  OK    OK    PIR/NBRF
embl      P      OK  OK    OK    EMBL
remtrembl P      OK  OK    OK    REMTREMBL sequences
sptrembl  P      OK  OK    OK    SPTREMBL sequences
sw        P      OK  OK    OK    SWISSPROT sequences
swissprot P      OK  OK    OK    SWISSPROT sequences
trarc     P      OK  OK    OK    TREMBL ARC sequences
trembl    P      OK  OK    OK    TREMBL sequences
tremblnew P      OK  OK    OK    New TREMBL sequences

showdb writes out a simple table displaying the names, contents and access methods for the databases:

  • ID allows programs to extract a single explicitly named entry from the database, for example: embl:x13776

  • Qry indicates that programs can extract a set of matching wildcard entry names. For example: swissprot:pax*_human

  • All allows programs to analyse all the entries in the database sequentially. For example: embl:*

You can access EMBL by either identifier e.g. L07770, or accession number e.g. L07770. Let's try these now.

4.2.3. seqret

seqret reads in a sequence and writes it out. It is probably the most commonly used EMBOSS program.

4.2.4. Exercise: seqret

% seqret
Reads and writes (returns) a sequence>
Input sequence: embl:L07770
Output sequence [L07770.fasta]:


% more L07770.fasta


>L07770 L07770 Xenopus laevis  rhodopsin mRNA, complete cds.
ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac
acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac
.
. Output truncated for brevity
.

Now let's retrieve the sequence using its the accession number:

% seqret
Reads and writes (returns) a sequence
Input sequence: embl:L07770
Output sequence [L07770.fasta]: L077702.fasta

% more L077702.fasta

>L07770 L07770 Xenopus laevis  rhodopsin mRNA, complete cds.
ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac
acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac
.
. Output truncated for brevity
.

You could also run this example entirely from the command line:

seqret embl:L07770 -outseq L07770.fasta

By default, seqret writes the sequence in FASTA format. You can also tell it to use a different output format:

seqret embl:L07770 -outseq gcg::L07770.gcg

As an alternative to specifying the format in the output sequence USA you can use the -osformat qualifier. This command is identical in action to the previous one:

% seqret embl:L07770 -outseq L07770.gcg -osformat gcg

% more L07770.gcg


!!NA_SEQUENCE 1.0

Xenopus laevis rhodopsin mRNA, complete cds.

L07770  Length: 1684  Type: N  Check: 9453 ..

    1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca

   51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga
.
. Output truncated for brevity
.

The list of the formats that EMBOSS supports is described elsewhere (Section A.1, “Supported Sequence Formats”).

4.2.5. Reading Sequences from Files

EMBOSS can also read sequences from files. For example, if you wanted to reformat the FASTA sequence you produced above into GCG format, you could type:

seqret L07770.fasta -outseq gcg::myseq.gcg

or

seqret L07770.fasta -outseq myseq.gcg -osformat gcg

4.2.6. infoseq

infoseq is a small utility to list the sequences' USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description. To view this information for the sequence:

% infoseq
Displays some simple information about sequences

Input sequence(s): sw:opsd_*
# USA              Name         Accession   Type    Length    Description
sw-id:OPSD_ABYKO   OPSD_ABYKO   O42294      P       289       RHODOPSIN (FRAGMENT).
sw-id:OPSD_ALLMI   OPSD_ALLMI   P52202      P       352       RHODOPSIN.
sw-id:OPSD_AMBTI   OPSD_AMBTI   Q90245      P       354       RHODOPSIN.
sw-id:OPSD_ANGAN   OPSD_ANGAN   Q90214      P       352       RHODOPSIN, DEEP-SEA
sw-id:OPSD_ANOCA   OPSD_ANOCA   P41591      P       352       RHODOPSIN.
sw-id:OPSD_APIME   OPSD_APIME   Q17053      P       377       RHODOPSIN.
sw-id:OPSD_ASTFA   OPSD_ASTFA   P41590      P       352       RHODOPSIN.
sw-id:OPSD_BATMU   OPSD_BATMU   O42300      P       289       RHODOPSIN (FRAGMENT).
sw-id:OPSD_BATNI   OPSD_BATNI   O42301      P       289       RHODOPSIN (FRAGMENT).
sw-id:OPSD_BOVIN   OPSD_BOVIN   P02699      P       348       RHODOPSIN.

4.2.7. Sequence Annotation

Sequence databases do not just contain sequences, they also contain a great deal of associated information (annotation) about the sequence entries. By default EMBOSS does not return all this information when you run seqret.

To retrieve the full entry for a sequence in its original database form you can use the utility entret:

% entret embl:U23808
Reads and writes (returns) flatfile entries
Output file [U23808.entret]:


% more U23808.entret
ID U23808 standard; DNA; VRT; 4734 BP.
XX
AC U23808;
XX
SV U23808.1
XX
DT 23-APR-1995 (Rel. 43, Created)
DT 04-MAR-2000 (Rel. 63, Last updated, Version 7)
XX
DE Xenopus laevis rhodopsin gene, complete cds.
XX
KW .
XX
OS Xenopus laevis (African clawed frog)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Amphibia;
OC Batrachia; Anura; Mesobatrachia; Pipoidea; Pipidae; Xenopodinae; Xenopus.
XX
.
. Output truncated for brevity
.

There is a lot of information here. Near the bottom, just above the sequence itself, is a list of features associated with the sequence. A feature is any defined region of the sequence that has a particular description associated with it. To view a simple graphical overview of the sequence features use the utility showfeat:

% showfeat embl:U23808
Show features of a sequence.
Output file [U23808.showfeat]:

% more U23808.showfeat



U23808
Xenopus laevis rhodopsin gene, complete cds.
|==========================================================| 4734
|----------------------------------------------------------> source
             |----->                                         mRNA
               |--->                                         CDS
                       |->                                   CDS
                       |->                                   mRNA
                                |->                          CDS
                                |->                          mRNA
                                      |-->                   CDS
                                      |-->                   mRNA
                                                  |>         CDS
                                                  |------->  mRNA

To retrieve a sequence with all its features you can use the program seqret with the option -feature. This has some properties of which to be aware:

% seqret -feature embl:U23808
Reads and writes (returns) one or more sequences
Output sequence [U23808.fasta]:

This looks pretty much like running seqret except that a second file has been created, u23808.gff. Take a look at this file:

% more u23808.gff

##gff-version 3
##sequence-region U23808 1 8914
#!Date 2009-02-18
#!Type DNA
#!Source-version EMBOSS 6.1.0
U23808	EMBL	databank_entry	1	8914	0.000	+	.	ID="U238
08.1";organism="Xenopus laevis";mol_type="genomic DNA";db_xref="taxon:8355"
U23808	EMBL	promoter	5128	5360	0.000	+	.	ID="U238
08.2"
U23808	EMBL	located_sequence_feature	5128	5158	0.000	+	
.	ID="U23808.3";note="XOP4 cis element"
U23808	EMBL	located_sequence_feature	5191	5215	0.000	+	
.	ID="U23808.4";note="XOP3 cis element"
U23808	EMBL	located_sequence_feature	5225	5239	0.000	+	
.	ID="U23808.5";note="Ret1 cis element"

.
. Output truncated for brevity
.

This is a list of the features in the database entry in GFF (General Feature Format). More information on feature formats is available elsewhere (Section 5.3, “Introduction to Feature Formats”). In order to change the feature formats and filename you need to use associated qualifiers when running seqret -feature. Let's save the features in EMBL format in the file rhodop.features:

% seqret -feature embl:U23808 -offormat embl -ofname rhodop.features
Reads and writes (returns) one or more sequences
Output sequence [U23808.fasta]:

And the file appears as expected.

We could use a Uniform Feature Object (UFO) instead of the -offormat and -ofname qualifiers:

seqret -feature embl:U23808 -oufo embl::rhodop.features

4.2.8. Using Multiple Sequences

EMBOSS programs can also deal with multiple sequences. A quick search using SRS will tell you that the SwissProt sequence corresponding to the EMBL sequence we've been looking at has the identifier OPSD_XENLA. To retrieve the information about all the other OPSD sequences in SwissProt you can use the wildcard character:

% infoseq
Displays some simple information about sequences
Input sequence(s): sw:opsd_*

# USA              Name         Accession Type Length  Description
sw-id:OPSD_ABYKO   OPSD_ABYKO   O42294    P    289     RHODOPSIN (FRAGMENT).
sw-id:OPSD_ALLMI   OPSD_ALLMI   P52202    P    352     RHODOPSIN.
sw-id:OPSD_AMBTI   OPSD_AMBTI   Q90245    P    354     RHODOPSIN.
sw-id:OPSD_ANGAN   OPSD_ANGAN   Q90214    P    352     RHODOPSIN, DEEP-SEA
sw-id:OPSD_ANOCA   OPSD_ANOCA   P41591    P    352     RHODOPSIN.
sw-id:OPSD_APIME   OPSD_APIME   Q17053    P    377     RHODOPSIN.
sw-id:OPSD_ASTFA   OPSD_ASTFA   P41590    P    352     RHODOPSIN.
sw-id:OPSD_BATMU   OPSD_BATMU   O42300    P    289     RHODOPSIN (FRAGMENT).
sw-id:OPSD_BATNI   OPSD_BATNI   O42301    P    289     RHODOPSIN (FRAGMENT).
sw-id:OPSD_BOVIN   OPSD_BOVIN   P02699    P    348     RHODOPSIN.

We can also use the wildcard character on the command line, but here you must enclose the specification in quotation marks to stop the UNIX shell trying to interpret the asterisk. One of:

infoseq "sw:opsd_*"
infoseq sw:opsd_"*"

The two commands are equivalent.

You can use seqret to retrieve multiple sequences into a file. For example:

seqret "sw:opsd_a*" -outseq opsd_a.seqs

That retrieves all the sequences whose identifiers begin with "opsd_a" into a file called opsd_a.seqs. If you wanted to have each sequence in a separate file, you could type:

seqret "sw:opsd_a*" -ossingle

Filenames will be automatically generated based on the identifiers of the sequences.

4.2.9. Listfiles

It is also possible to use "listfiles" within EMBOSS. Instead of containing the sequences themselves, a listfile contains "references" to sequences - so, for example, you might include database entries, the names of files containing sequences, or even the names of other listfiles. You'll need to use a text editor to create the appropriate listfiles if you'd like to try this yourself.

Here's an example of a valid listfile, called seq.list:

opsd_abyko.fasta
sw:opsd_xenla
sw:opsd_c*
@another_list

If you have created this file then you can read it using:

more seq.list

This may look a bit odd, but it's really very straightforward: The file contains:

  • opsd_abyko.fasta - this is the name of a sequence file. The file is read in from the current directory.

  • sw:opsd_xenla - this is a reference to a specific sequence in the SwissProt database

  • sw:opsd_c* - this represents all the sequences in SwissProt whose identifiers start with "opsd_c"

  • another_list - this is the name of a second listfile

Notice the @ in front of the last entry. This is the way you tell EMBOSS that this file is a listfile, not a regular sequence file. As an alternative to using the @ you can write list:filename instead. Let's demonstrate this by using this file as the input to seqret and get the sequences into a new file, perhaps for use in a multiple sequence alignment.

First of all, make the file opsd_abyko.fasta using seqret:

seqret sw:opsd_abyko -outseq opsd_abyko.fasta

Now let's create another_list. Note that its structure is very similar to that of seq.list but this time only contains database references:

sw:opsd_anoca
sw:opsd_apime
sw:opsd_astfa

After you have created this file you will be able to view it using:

more another_list

Finally, let's run seqret with seq.list (not forgetting the @ sign) and look at the results:

% seqret @seq.list -outseq outfile

% more outfile


>OPSD_ABYKO O42294 RHODOPSIN (FRAGMENT).
YLVNPAAYAALGAYMFLLILIGFPINFLTLYVTLEHKKLRTPLNYILLNLAVANLFMVLG
GFTTTMYTSMHGYFVLGRLGCNLEAFFATLGGEIALWSLVVLAIERWIVVCKPISNFRFT
EDHAIMGLAFTWVMALACAVPPLVGWSRYIPEGMQCSCGVDYYTRAEGFNNESFVIYMFI
VHFLIPLSVIFFCYGRLLCAVKEAPAAQQESETTQRAEKEVSRMVVIMVIGFLVCWLPYA
SVAWWIFCNQGSDFGPIFMTLPSFFAKSAAIYNPMIYICMNKQFRHCMI
>OPSD_XENLA P29403 RHODOPSIN.
MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMFLLILLGLPINFMTLF
VTIQHKKLRTPLNYILLNLVFANHFMVLCGFTVTMYTSMHGYFIFGPTGCYIEGFFATLG
GEVALWSLVVLAVERYIVVCKPMANFRFGENHAIMGVAFTWIMALSCAAPPLFGWSRYIP
EGMQCSCGVDYYTLKPEVNNESFVIYMFIVHFTIPLIVIFFCYGRLLCTVKEAAAQQQES
LTTQKAEKEVTRMVVIMVVFFLICWVPYAYVAFYIFTHQGSNFGPVFMTVPAFFAKSSAI
YNPVIYIVLNKQFRNCLITTLCCGKNPFGDEDGSSAATSKTEASSVSSSQVSPA
>OPSD_CAMAB Q17292 RHODOPSIN.
MMSIASGPSHAAYTWASQGGGFGNQTVVDKVPPEMLHMVDAHWYQFPPMNPLWHALLGFV
IGVLGVISVIGNGMVIYIFTTTKSLRTPSNLLVVNLAISDFLMMLCMSPAMVINCYYETW
VLGPLFCELYGLAGSLFGCASIWTMTMIAFDRYNVIVKGLSAKPMTINGALIRILTIWFF
TLAWTIAPMFGWNRYVPEGNMTACGTDYLTKDLFSRSYILIYSIFVYFTPLFLIIYSYFF
IIQAVAAHEKNMREQAKKMNVASLRSAENQSTSAECKLAKVALMTISLWFMAWTPYLVIN
YSGIFETTKISPLFTIWGSLFAKANAVYNPIVYGISHPKYRAALFQKFPSLACTTEPTGA
DTMSTTTTVTEGNEKPAA
>OPSD_CAMHU O18312 RHODOPSIN (FRAGMENT).
LHMIHLHWYQYPPMNPMMYPLLLIFMLFTGILCLAGNFVTIWVFMNTKSLRTPANLLVVN
LAMSDFLMMFTMFPPMMVTCYYHTWTLGPTFCQVYAFLGNLCGCASIWTMVFITFDRYNV
IVKGVAGEPLSTKKASLWILSVWVLSTAWCIAPFFGWNHYVPEGNLTGCGTDYLSEDILS
RSYLYIYSTWVYFLPLAITIYCYVFIIKAVAAHEKGMRDQAKKMGIKSLRNEEAQKTSAE
CRLAKNAMTTVALWFIAWTPCLLINWVGMFARSYLSPVYTIWGYVFAKANAVYNPIVYAI
S
.
. Output truncated for brevity
.

Note that the output file contains all the sequences you specified in seq.list, as you had expected.