The Uniform Sequence Address (USA) is a standard sequence naming scheme used by all EMBOSS applications. Typically, one or more sequences are read from a file or from a larger database. However, other sources such as an application or web server can be specified in a USA (Section 6.6, “The Uniform Sequence Address (USA)”).
A USA specifies:
The sequence format to expect
The file or database to open
The entry or entries to read
The general format of a USA specification is:
|
where Format
is the database format of a file of sequences (FileName
) or installed database (DatabaseName
) you have provided and Entry
is the database entry code.
Only FileName
or DatabaseName
is strictly necessary. If the expected format is omitted then EMBOSS will attempt parsing with a carefully organised list of supported formats (Section A.1, “Supported Sequence Formats”) until one succeeds. If the database entry code is omitted, then all of the entries in the file or database are read.
Here are some common variants of USAs:
|
|
|
|
|
@ |
ListFileName
is the name of a listfile which itself can contain a list of valid USAs. The ::
and :
syntax is to allow, for example, "embl" and "pir" to be both database names and sequence formats.
In the following examples, AccessionNumber
is the sequence's accession number in the database, and DatabaseId
is its identifier:
|
myfile.seq |
|
embl:X65923 |
|
swissprot:opsd_xenla |
The full command line syntax of the possible USAs are give below. Whitespace has been added for clarity but would not be used on the command line:
asis ::
Sequence
[start
: end
: reverse]
Format
:: @ ListFileName
[start
: end
: reverse]
Format
:: list : ListFileName
[start
: end
: reverse]
Format
:: DatabaseName
: Entry
[start
: end
: reverse]
Format
:: DatabaseName
-SearchField
: Word
[start
: end
: reverse]
Format
:: FileName
: Entry
[start
: end
: reverse]
Format
:: FileName
: SearchField
: Word
[start
: end
: reverse]
Format
:: ProgramName
ProgramParameters
| [start
: end
: reverse]
The tokens (Sequence
, Format
etc.) are described below.
Sequence
is an explicit sequence in either upper or lower case, for example:
atgctgacgatgcg |
TPRPGKNTEARLNCF |
etc. |
Format
must be a name of one of the valid sequence formats (Section A.1, “Supported Sequence Formats”).
The sequence format may usually be omitted when reading in a sequence; EMBOSS will try most known sequence formats until it can read the sequence.
ListFileName
is the name of a listfile: a file of USAs with one USA per line. Either @
or list:
are required before the name of the listfile to indicate that it is a listfile. Listfiles may be nested (a listfile may contain the USA of another listfile).
Where the sequence specification [
is used, then all the USAs in the listfile are affected, unless these USAs have their own start
: end
: reverse][
specifier in which case that given on the command line is overridden.start
: end
: reverse]
This also holds true where the sequence is specified with the -sbegin
or -send
or any other command line qualifier (Section 6.4, “Datatype-specific Command Line Qualifiers”) which affects the input sequence: all USAs in the listfile are affected unless they have their own sequence specification.
DatabaseName
must be a valid database name as defined in the EMBOSS configuration files (Section 2.8, “Maintenance”).
If the name is not a valid database, a file with the same name is looked for instead. Database names may have Search Field names appended to them (for example embl-des
, embl-id
) (see below).
Entry
specifies the ID name or accession number of one or more sequences in a database or file. If it is omitted, then all the files in the database or file will be read. Entry
may be wildcarded. For example hs*
will match all ID names starting with hs
.*
indicates that all entries in the database or file will be read.
There may be restrictions on certain databases preventing access to a single entry, wildcarded entries or reading in all entries. This is a consequence of the way some databases are accessed. The restrictions are given in the database definition (see the EMBOSS Administrators Guide).
A database or file location must be given as part of a USA that has an Entry
; you cannot give an entry name on its own, i.e. you cannot give just an accession number or ID name and expect EMBOSS to deduce that it is indeed an accession number or ID name and to which database it might refer.
SearchField
is the name of one of the available search fields shown in the table (Table 6.3, “Sequence Retrieval Search Fields”).
Name | Search Field |
---|---|
acc | Accession number |
des | Description |
id | ID name |
key | Keyword |
org | Organism name |
sv | Sequence version/GI number |
Word
is the keyword to search for in the search field. Words may be wildcarded.
Words in ORG
and KEY
fields may contain spaces because the complete key-phrase or organism classification level (the text field (including spaces) between the semicolons (;
) delimiting sections of these fields) is indexed as one 'word'.
Words in the DES
field contain only alphanumeric characters and thus end at spaces or other non-alphanumeric characters.
The words in ID
and ACC
fields are equivalent to Entry
above.
Program
is the name of a sequence retrieval application in the current path. ProgramParameters
are any parameters it takes in order to specify one or more entries.
Any USA may optionally take a subsequence specifier after the main body of the USA in one of the following forms:
[start : end] |
[start : end : r] |
Where start
and end
are the required start and end positions. Negative positions count from the end of the sequence. Zero values for start
and end
stand for the default values, i.e. position 1 and the length of the sequence respectively.
Use of the USA subsequence specifier is equivalent to using the -sbegin
or -send
or -sreverse
command line qualifiers. For more information see Section 6.4, “Datatype-specific Command Line Qualifiers”).
The format, if specified, goes right at the start of the USA. For example:
|
|
The sequence format can be any of those supported by EMBOSS (Section A.1, “Supported Sequence Formats”).
If the format is omitted from the USA, EMBOSS will check supported formats, in a carefully defined order, until the sequences are read successfully. Therefore it's not usually necessary to specify the format, although the application may run faster if you do as the tests will not need to be performed.
It's never necessary to specify the format of entries in a sequence database. All databases must be defined in the EMBOSS configuration files (Section 2.8, “Maintenance”) and the definitions include the format of the database.
The one case where it is recommended to specify the format is for sequence input in "plain" format, i.e. just the sequence without annotation, title or comments. This is because some variations of "plain" format may not otherwise be recognised by EMBOSS. If a format is not recognised, the application will fail with an informative error message.
The database name is specified in a USA before either an entry to retrieve or a search field:
|
|
The name of any database you've defined in your EMBOSS installation can be used. Databases are defined in your EMBOSS configuration files (Section 2.8, “Maintenance”). To find out what local databases are available run:
showdb |
This will give a table of the database names, whether they are protein or nucleic and the types of access that is possible (see below). If EMBOSS was set up by your system administrator it's likely that one or more of the following major databases will have been set up:
EMBL - nucleic sequences from the EMBL-EBI
GenBank - nucleic sequences from the NCBI
SwissProt - protein sequences from the EMBL-EBI/ExPASy
PIR - protein sequences from the NBRF
Abbreviations of these names are often used, for example em
for databases in EMBL format. There is no standard naming scheme for databases because total control over database setup (including naming) is given to you or your local system administrator (the person who set up EMBOSS at your site). The dot character ('.') is, however, not allowed in database names. EMBOSS interprets a '.' character as being part of a file name.
The simplest way to specify a database entry in a USA is:
|
where DatabaseName
is the name of a database and Entry
is either the sequence's accession number or ID in that database. For example:
embl:x13776 |
swissprot:opsd_xenla . |
EMBOSS will try searching for your specified sequence by both the accession number field and the ID
name field. You don't need to specify whether you gave the accession number or ID. The database name and entry are case-insensitive: they can be in either upper or lower-case. For example: EM:AF061303
is the same as em:af061303
.
You cannot specify a sequence in EMBOSS by giving just the ID name or accession number; the database name must be given. You cannot therefore just give X65923
and expect EMBOSS to know what this is - it will assume that X65923
is the name of a database or a file which of course is unlikely to exist.
It's common to run an application on all the entries in a database. This can be done by just giving the name of the database. Typically, however, an asterisk is used to indicate all entries are required. Either of the following therefore refer to all of the entries in the EMBL database:
embl |
embl:* |
Often a set of wildcarded entry names in a database are required. Wildcard text is specified by a *
whereas a single wildcard character is specified by using a ?
character. For example:
swissprot:*_human |
refers to all the human entries in swissprot (strictly, it is all the entries in swissprot whose names end in _human
.)
The specifications for a complete database or wildcarded entry names both refer to multiple entries in a database, but are implemented in EMBOSS in a very different way. When all entries are read, the application starts at the beginning of the database and reads an entry at a time. In contrast, reading wildcarded entries requires an index file of entry ID names and accession numbers. The index file is queried and gives the positions in the database of those entries whose names match the wildcarded specification. For more information on database indexing see the EMBOSS Administrators Guide.
Not all databases will be searchable by all types of sequence specifications. For example, databases that are set up to access a web site will probably not allow retrieval of wildcarded entry name specifications or complete databases: it would take too long to transfer the files across the Internet!
The application showdb will give a list of the available databases, together with the ways in which they can be accessed. This information is given under the three columns ID
, Query
and All
:
ID
Applications can extract a single explicitly-named entry from the database, e.g. embl:x13776
Query
Applications can extract a set of matching wildcard entry names, e.g. swissprot:pax*_human
All
Applications can read all entries sequentially, e.g. embl:*
Ideally all of the databases available on your site will be available using all three methods, but this may well not be the case, so you should check how you can access the databases by running showdb.
Be aware that using *
or ?
on the UNIX command line is problematic. UNIX tries to interpret the word containing the *
or ?
as a wildcarded filename to be matched to existing files. When this fails UNIX gives an error message without running the application. To avoid this, these characters need to be hidden in quotes or preceded by a backslash on the UNIX command line. For example:
seqret "embl:*" or |
seqret embl:\* |
Quoting of wildcard characters is only required on the command line. It is not required when replying to an application prompt or when filling in a field on a GUI's form. This, for example, is fine:
%
seqret
Reads and writes (returns) sequences Input sequence(s): embl:* ..
stdin
There is a system filename (stdin
) that you can give whenever an input filename is requested. If you enter this name, then the resulting sequence will be read from the keyboard. This is only useful when you wish to type the sequence immediately, or are 'piping' the results from a previous application into the current application.
You can specify the format to read in by using
. For example: format
::stdin
gcg::stdin |
A sequence filename is specified in a USA before an entry to retrieve or a search field:
|
|
Any file containing sequences can be used but the sequence must be in one of the formats that EMBOSS supports (Section A.1, “Supported Sequence Formats” The filename is case-sensitive: FRED.SEQ
is not the same filename as fred.seq
.
Most sequence formats allow files to contain more than one sequence in the same file. Some formats however, such as gcg, plain, raw, staden do not: they have no indication of where the sequence ends and the next sequence starts.
If just the name of the file containing multiple sequences is specified, then all the sequences in that file will be read. This is the equivalent of specifying filename:*
. For example
myclones.seq |
is the same thing as
myclones.seq:* |
The simplest way to specify a single specific sequence in a file containing multiple sequences is:
|
where FileName
is the name of a file and Entry
is the sequence's ID name or accession number in that file. For example the following USA would specify a sequence in the file myfile.fasta
whose ID name is xyz_123
:
myfile.fasta:xyz_123 |
As for database entries, you cannot specify a sequence in EMBOSS by giving just the ID name, the file name must be given.
To help GCG users, an additional syntax is allowed where the entry name is enclosed in curly brackets:
. |
When given on the command line the brackets must be escaped as follows:
Filename\{Entry\} |
To specify wildcarded sequence names, the wildcard characters '*
' and '?
' are again used. When used on the command line (but not in response to an EMBOSS prompt) they must be enclosed in quotes or preceded by a backslash. For example:
myfile.fasta:IXI* (in response to a prompt) |
"myfile.fasta:IXI*" (on the command line) |
will read in all sequences in the file myfile.fasta
whose ID name starts with IXI
.
A listfile is specified by giving @
or list:
before the name of the listfile as follows:
@ |
list: |
An EMBOSS listfile is a file of USAs with one USA per line. They are essentially the same idea as a "File of Filenames" used in the Staden Package. However, instead of containing the sequences themselves, a listfile contains references (USAs) to sequences. Any valid USA can be given as a reference so, for example, you might include database entries, the names of files containing sequences, or even the names of other listfiles. For example, here's a valid listfile:
opsd_abyko.fasta sw:opsd_xenla sw:opsd_c* @another_list
The contents are as follows:
opsd_abyko.fasta is the name of a sequence file. |
sw:opsd_xenla is the name of a specific sequence in the swissprot database |
sw:opsd_c* specifies all the sequences in swissprot whose ID names start with opsd_c |
another_list is the name of a second (nested) listfile |
Notice the @
in front of the last entry. This indicates the file is a listfile, not a regular sequence file. Alternatively, list:
may be used in place of @
.
Any blank lines or lines starting with a #
character (typically used for informative comments) are ignored.
The simplest USA specification uses asis
to specify a sequence directly, i.e. as a string and not in a file or database. The syntax is:
asis:: |
For example: asis::atgctagcttagctgac
specifies the sequence atgctagcttagctgac
.
asis
can only specify one sequence at a time. The sequence has no ID name or title.
An unusual way of getting a sequence is to run an application to extract it from some other system. This is done by specifying the application's name and the sequence. These must be followed by a pipe (|
) character.
|
For example:
getz -e [embl-id:AF061303] |
will invoke getz (the SRS sequence retrieval application) to extract entry AF061303
from EMBL. Any application or script which writes one or more sequences to screen (stdout
) can be used in this way.
So far you have specified individual sequences in files or databases by using their ID name or their accession numbers, which are the default search fields. There are, however, other ways to specify sequences using other data fields defined in sequence database entries. An excerpt from typical sequence entry in EMBL format is shown below:
ID X65923; SV 1; linear; mRNA; STD; HUM; 518 BP.
XX
AC X65923;
XX
DT 13-MAY-1992 (Rel. 31, Created)
DT 18-APR-2005 (Rel. 83, Last updated, Version 11)
XX
DE H.sapiens fau mRNA
XX
KW fau gene.
XX
OS Homo sapiens (human)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;
OC Homo.
XX
... The rest of entry is not shown
You can see the accession number (AC
) and ID name (ID
). Sequence retrieval is also possible by sequence version number (SV
) and by specifying sequences that contain words occurring in their short description field (the DE
line), their "Keyword" field (KW
) or the Organism fields (OS
and OC
lines).
A search for ID name, accession number and version number, which are all usually unique to a sequence, will retrieve a single sequence only. In contrast, words in the description or organism name, for example, are not unique and searches against such fields will probably find more than one match. In this case you will get more than one sequence entry returned, as is often the case when you specify a wildcarded ID name.
You must explicitly specify which field type to search by using one of the search field names given in the table below (Table 6.4, “Database Search Fields”), together with the data to search for.
Name | Search Field |
---|---|
acc | Accession number |
des | Description |
id | ID name |
key | Keyword |
org | Organism Name |
sv | Sequence Version/GI Number |
The type of field to search by is specified by adding a field name to the database name, for example:
embl-des:fau |
When specifying a search field in a sequence file (as opposed to a database) the notation is a little different: you use a ':
' (colon) instead of a '-
' (dash), for example:
myclones.seq:des:fau |
This is because myfile.seq-des
could be a valid file name whereas myfile.seq:des
is not.
Currently you can only specify one search field at a time.
Missing description, keyword, organism or sequence version fields cause queries to fail. If the file or database you are searching doesn't contain the field you are searching for then you will get an error message, something like:
"Error: Unable to read sequence xxx.seq:org:homo" |
The id
and acc
search fields can normally be omitted. If no search field is specified, (for example embl:X13776
), then the default is to search for a match in both the id
and acc
fields .
Using database-acc:
or AccessionNumber
file:acc:
is a way of telling EMBOSS that it need not try to search for the entry by testing both the ID name field and the accession number field; it only needs to test accession number. This is allowed for ID too, for example, AccessionNumber
database-id:
. Specifying the ID
acc
and id
search fields will make accessing the sequences slightly faster, but they are not required. EMBOSS applications report USAs in this style however, so do not get alarmed when you see it.
The ORG
, KEY
and DES
fields have the following meaning:
ORG
The full organism classification names (OC
field in EMBL).
KEY
Words and phrases that classify the entry by form and function, as specified by the database curators. (KW
field in EMBL).
DES
Brief one-line description of the sequence entry. This field is the title line in simple sequence formats, such as fasta
format) (DE
field in EMBL).
Searches in these fields are by word. For example embl-des:fau
will search for the text "fau" in the description field. If you wish to search for part of a word, use an asterisk to indicate a wildcard. For example: embl-des:h*emoglobin
. The searches are case-insensitive: 'Human' is the same as 'human'.
The definition of a 'word' in KEY
and ORG
searches is anything that matches the text field (including spaces) between the semicolons (;
) delimiting the sections of these fields, or the entire field if no sections are described as is the case for the KW
field in the EMBL example above.
Therefore, embl-key:"fau gene"
would match the entry X13776
displayed above, as would embl-key:fau*
, but embl-key:fau
would not match it.
Similarly, embl-org:"homo sapiens (human)"
and embl-org:*human*
and embl-org:hominidae
would match this entry, but embl-org:human
would not match it as the 'word' that contains "human" is "Homo sapiens (human)". The search embl-org:homo
would match as the word "Homo" occurs in its own field at the end of the second OC
line.
The definition of a 'word' is much more intuitive in DES
searches: a 'word' is bounded by spaces and other non-alphanumeric characters. Words start with a letter or number, and end with a letter or number. SRS typically does the same, but allows a single quote at the end. This catches words such as 3' and 5' but is a problem with some quoted text.
Therefore embl-des:fau
and embl-des:sapiens
match. "H.sapiens" is not a word - it is split into the words 'H' and 'sapiens' because the dot (.
) is not an alphanumeric character. Phrases don't work for the DES field; it is word based, so the search embl-des:"fau mRNA"
will fail.
Sequence versions are formed from the accession number followed by a full stop ('.
') and then the number of releases there have been of this sequence. (e.g. X65923.1
). It makes it possible to find the current version of any sequence and to find the SV
of all previous versions. Further, a sequence may be unambiguously identified by the sequence version, for example: embl-sv:X65923.1
Care is needed however. In February 1999, everything in DDBJ/EMBL/GenBank was assigned version 1, even if it was the 1st or 10th version for a given sequence. Consider the entry below:
ID AC000003; SV 1; linear; genomic DNA; STD; HUM; 122228 BP. XX AC AC000003; XX DT 01-OCT-1996 (Rel. 49, Created) DT 07-MAR-2000 (Rel. 63, Last updated, Version 6) XX DE Homo sapiens chromosome 17, clone 104H12, complete sequence. XX KW HTG. XX
The entry AC000003
shows version 1, but is really the third sequence version (3rd gi
) for that record (see http://www.ncbi.nlm.nih.gov:80/entrez/sutils/girevhist.cgi?val=AC000003). Rather confusingly, the version on the DT
line has nothing to do with the sequence version (SV
)
If, after Feb 1999, the author had updated the sequence of AC000003
, then that new one would be version 2 (AC000003.2
) and it is a lot easier for a human to track sequence version changes when you see the incremental increase. Bear in mind that just because you are looking at SV X00001.1
it doesn't mean you have the first version that was ever in the databases (DDBJ, EMBL, GenBank).
Both sequence version identifiers and GI numbers (see below) share the sv
field in USAs.
GI numbers are assigned to entries in GenBank and other sequence databases originating from the NCBI. They are an integer key for identifying the entry version. For example:
VERSION AF181452.1 GI:6017929 ^^^^^^^^^^ ^^^^^^^^^^ Compound NCBI GI Accession Identifier Number
The NCBI GI identifier on the VERSION
line serves as a method for identifying the sequence data that has existed for a database entry over time. GI identifiers are numeric values of one or more digits. Since they are integer keys they are less human-friendly than the accession version system described above. If the sequence changes a new integer GI will be assigned.
A sequence may be unambiguously identified by the GI Number, for example: genbank-sv:6017929
.
Two methods for identifying the version of the sequence associated with a database entry are used because:
Some data sources processed by NCBI for incorporation into its Entrez sequence retrieval system do not version their own sequences.
GIs provide a uniform integer identifier system for every sequence NCBI has processed. Some products and systems derived from (or reliant upon) NCBI products and services prefer to use these integer identifiers because they can all be processed in the same manner.
Both sequence version identifiers (see above) and GI numbers share the sv
field in USAs.
The start and end of the sequence is specified by appending [
to the end of the USA. For example: start
:end
]
myfile.fasta[20:45] |
specifies the sequences in the file myfile.fasta
starting at 20 and ending at position 45.
If the 'start' or 'end' position is given as a negative number, then the position is counted from the end of the sequence. For example:
myfile.fasta[-10:-1] |
specifies the last 10 residues.
If [start:end:r]
is given at the end of the USA, then nucleotide sequenced are reverse-complemented. For example:
myfile.fasta[1:-1:r] |
is the whole sequence reverse-complemented.
Zeros can be used to denote the start and end of the complete sequence. For example, the entire sequence may be specified by:
myfile.fasta[0:0] |
The following are valid USAs for sequences:
asis:: |
@ |
list:: |
|
|
|
|
|
|
Each of the above can have [
or start
: end
][
appended to them.start
: end
: reverse]
The FileName
and DatabaseName
forms of USA can have format::
in front of them to specify the format although this is not normally necessary. Some examples are shown below (???).
Type | Example | Description |
---|---|---|
| xxx.seq | A sequence file xxx.seq in any format |
| fasta::xxx.seq | A sequence file xxx.seq in FASTA format |
| embl:X13776 | EMBL entry X13776 , using whatever access method is defined locally for the EMBL database |
| embl:X13776 | EMBL entry X13776 , using whatever access method is defined locally for the EMBL database and searching by accession number and entry name (X13776 is the accession number in this case) |
| embl-acc:X13776 | EMBL entry X13776 , using whatever access method is defined locally for the EMBL database and searching by accession number only |
| embl-id:X13776 | EMBL entry X13776 , using whatever access method is defined locally for the EMBL database, and searching by ID only |
| embl-des:lectin | EMBL entries containing the word 'lectin' in the 'Description' line |
| embl-org:*human* | EMBL entries containing the wildcarded word 'human' in the 'Organism' fields |
| embl:X1377* | EMBL entries with the prefix X1377 , usually in alphabetical order, using whatever access method is defined locally for the EMBL database |
| embl or EMBL:* | All sequences in the EMBL database |
@ | @mylist | Reads file mylist and uses each line as a separate USA. Listfiles can contain references to other list files or any other standard USA. |
list: | list:mylist | Same as @mylist above |
| 'getz -e [embl-id:X13776] |' | The pipe character | causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry X13776 from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way. |
asis:: | asis::atacgcagttatctgaccat | For specifying literal sequences on the command lines. |