EMBOSS provides excellent database support. All the common sequence formats you are likely to come across are supported. See the EMBOSS Users Guide.
A variety of indexing and access methods are supported. For example, EMBL entries can be read from :
A non-indexed EMBL-format flatfile held locally.
Original EMBL flatfiles using the CD-ROM, Staden or EMBOSS indexes
Original EMBL flatfiles using local SRS indexes
A file indexed for use with BLAST version 2 indexes
GCG database format
A query to the EMBL-EBI DBFETCH service
A query to the EMBL-EBI web server
A query to the Entrez web server
A query to any MRS web server
A query to any SRS web server (local or remote)
A relational database such as Sybase or Oracle by calling a local application.
Databases can be held locally and both indexed and non-indexed local files are supported. Tools for database indexing (Section 4.5, “Database Indexing”) are provided. One is a variation on the emblcd
system, the other uses an updatable tree. They provide rapid access to single sequences and rapid queries of flat file databases. The dbi* indexing applications assume that you have one or both of ID and accession number in each record and that they are unique for the whole database index, whereas the dbx* applications can handle non-unique (duplicate) IDs and source files >2Gb in size. Use of the dbx* indexing applications is preferred.
EMBOSS also provides methods for retrieving sequences via the WWW. If sequences on a server are in a format unknown to EMBOSS, it might be possible to specify they are converted to FASTA format before they are served. There are three methods for interaction with a local SRS installation or SRS on a remote public server. SRS queries can be made not only by ID and Accession number, but also (depending on the way a database has been indexed) on words in the description line, sequence version (or GI numbers), keywords or organism names.
Specialised access methods are provided for databases served by MRS
, NCBI's entrez
and EMBL-EBI's dbfetch
servers
For more general access through web servers, the url
access method allows a database to be defined as a URL into which a user-specified ID is inserted.
For other non-flatfile databases or flat file databases in formats not currently supported by EMBOSS, it is possible to configure an external application to retrieve sequences.
There are three basic levels of query:
A single entry specified by database ID or accession number is retrieved.
One or more entries matching a wildcard string in the Uniform Sequence Address (USA, see the EMBOSS Users Guide) are retrieved (this can be slow for some methods).
All entries are read sequentially from a database.
One or more query levels may be specified for each database configuration.
There are many methods (Section 4.3, “Database Access Methods”) for accessing databases. The available methods depend on the query level: i.e. whether a single entry, a wildcard-specified set of entries or all of the database entries are to be retrieved. For example, a web server might be suitable for retrieving a single or few entries but probably, quite sensibly, will not allow an entire database to be retrieved over the Internet. In contrast, a flat file database with no index is often (depending upon its size) only useful for reading all the entries sequentially ('all' retrieval level).
A database can be defined with a single retrieval method using the method
attribute. Alternatively, multiple methods may be defined, depending on which type (entry, query, all) of access is required. The attributes methodentry
, methodquery
and methodall
are used for this. This would be essential in the cases described above, to access the database in the different locations.
In addition, each access method needs to know something about the database. What is needed will be different for each method, although there is, of course, much overlap between them. This information is specified by using the 'key: value' attributes. The required attributes depend on the access method and the query level.
Database key: value
attributes and access methods (Section 4.3, “Database Access Methods”) are described below.
Every database you intend to use must be defined in one of the EMBOSS configuration files:
emboss.default |
.embossrc |
emboss.default
is kept in the top-level EMBOSS directory (e.g. /usr/local/emboss/share/EMBOSS/emboss.default
) and is used for defining site-wide databases. In contrast, .embossrc
lives in your home directory and is used for defining your own databases or, for example, testing database definitions before adding them to the site-wide emboss.default
file.
Each database is configured using a database definition. The generalised form is:
DBNAMEDatabaseName
[key
:value
key
:value
key
:value
key
:value
]
DBNAME
, which is usually shortened to DB
, is followed by the database name (DatabaseName
) then a set of key
: value
attributes that specify that database. The key
: value
attributes are all enclosed by a pair of square brackets.
The key
: value
pairs are the configuration options and must contain:
A description of the access method (using method:
) or one or more of:
methodsingle: |
methodquery: |
methodall: |
A description of the original format of the sequences (using format:
).
Additional key
: value
pairs might be required depending on the access methods. Others are optional.
As an illustration, to set up direct access to the EMBL and SwissProt test databases distributed with EMBOSS, your emboss.default
or .embossrc
file should look like something like this:
DB embl [ type: "N" method: "direct" format: "embl" dir: "/home/auser/EMBOSS-6.2.0/test/embl/" file: "*.dat" comment: "Test EMBL in EMBOSS distribution" ] DB swissprot [ type: "P" method: "direct" format: "swiss" dir: "/home/auser/EMBOSS-6.2.0/test/swiss/" file: "seq.dat" comment: "Test Swissprot in EMBOSS distribution" ]
Or to set up access to the EMBL and swissprot databases via SRS at the EMBL-EBI, your emboss.default
or .embossrc
file should look like this:
DB swissprot [ type: "P" method: "srswww" format: "swiss" url: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz" comment: "Swissprot via EBI SRS" ] DB embl [ type: "N" method: "srswww" format: "embl" url: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz" comment: "EMBL via EBI SRS" ]
An emboss.default.template
file is provided in the EMBOSS distribution. As its name suggests, it gives examples of some of the possible database definitions supported by EMBOSS (see the next section). An excerpt of the emboss.default.template
file is show below:
#SET emboss_tempdata path_to_directory_$EMBOSS/test # Logfile - set this to a file that any user can append to # and EMBOSS applications will automatically write log information #SET emboss_logfile /packages/emboss/emboss/log # pir (cytochrome C plus first entries in other divisions) # === DB tpir [ type: P dir: $emboss_tempdata/pir method: gcg file: pir*.seq format: nbrf fields: "des org key" comment: "PIR in 4 files in GCG format indexed by dbigcg" ] # Genbank (Remote access to an MRS server) # ======= DB genbank [ type: N methodentry: mrs3 format: genbank dbalias: "genbank_release" url: "http://mrs.cmbi.ru.nl/mrs-3/plain.do" comment: "GenBank IDs via MRS" ] # genbank (the first few entries from several sub-section files) # ======= DB tgenbank [ type: N dir: $emboss_tempdata/genbank method: emblcd format: genbank release: 01 fields: "sv des org key" comment: "GenBank native format indexed by dbiflat" ]
To see how databases are set up under EMBOSS, you should look at the configurations for the test databases included in the EMBOSS distribution. The EMBOSS developers use these databases to test database indexing and sequence reading. They also contain the sequences that are used in the usage examples for the applications (see the application documentation online or by running tfm). They include:
test/data
(emrod
(DNA) and swnew
(protein) are in BLAST format)
test/embl
(*.dat
for EMBL format, .ref
and .seq
for gcg format)
test/pir
(.ref
and .seq
for nbrf format)
test/swiss
(.dat
for swissprot format, 1 file)
test/swnew
(.dat
for swissprot format, 3 files)
test/wormpep
(wormpep
is in FASTA and BLAST format)
The template file (emboss.default.template
) in the EMBOSS distribution (e.g. /usr/local/emboss/share/EMBOSS/emboss.default.template
) contains configurations for all the test databases. You can use emboss.default.template
as a template for entries in your own emboss.default
file. For any database definitions you use, change the definition of emboss_tempdata
to point to your test directory and uncomment the line. You'll then be able to use the test databases as "tembl", "tsw" and so on.
One of the first things an EMBOSS application does when it runs is to read in the installed emboss.default
(and then the ~/.embossrc
file, if it exists). This means that any changes to these definition files take effect as soon as they are made.
For example, change:
# swissprot (Puffer fish entries) # ========= DB tsw [ type: P dir: $emboss_tempdata/swiss method: emblcd format: swiss release: 36 fields: "sv des org key" comment: "Swissprot native format with EMBL CD-ROM index" ]
to
# swissprot (Puffer fish entries) # ========= DB tsw [ type: P dir: /home/auser/EMBOSS-6.2.0/test/swiss method: emblcd format: swiss release: 36 fields: "sv des org key" comment: "Swissprot native format with EMBL CD-ROM index" ]
Alternatively, to get all the test databases supported, rename or copy emboss.default.template
to emboss.default
and edit the file as follows. This line:
# SET emboss_tempdata path_to_directory_$EMBOSS/test
must be uncommented and the definition changed to the directory where the databases are installed. In the following example this is /usr/local/share/EMBOSS/test
. For example:
SET emboss_tempdata /usr/local/share/EMBOSS/test # or SET emboss_tempdata /home/auser/workspace/emboss/emboss/test/ # or something else
The directory where the test databases are installed can be changed with --prefix
when you configure EMBOSS.
Having defined your databases (see Section 4.1, “General Database Configuration”), you can run showdb -full
and you should see them all appear in the list of databases. If the message Warning: Bad database definition
is generated or if a database doesn't appear then something is seriously wrong with your definition. Go back to it and check things. Common mistakes include:
Have you left off the terminal square bracket ]
?
Did you leave out a colon character :
in an attribute?
Have you forgotten to put in the closing quotes around some text?
Is the emboss.default
file world-readable?
If showdb
displays your database, check that all of your required access methods are listed as OK
. If something is not OK
then another access method might be required.
Just because showdb
finds a database definition does not mean the database is working correctly: showdb does not attempt to extract any entries from your database. Therefore you should try extracting one or more known entries from the database using seqret. If you get errors, you should check that the database is set up correctly and defined correctly. Things to check include:
Are the data files and indexes world-readable?
If using method:
emblcd
, gcg
, blast
or emboss
did you index the data files?
If using app:
is the application in your PATH
?
If using app:
is the PATH
specified correctly?
If using app:
is the application world-executable?
If using url:
or srswww
is the server up?
If using url:
or srswww
is the server URL correct?
Are file:
wildcards specified correctly?
Are directory:
paths specified correctly?
Have you put the files there yet?
If using any SRS method, did you use dbalias:
?
If using any SRS method, check the dbalias:
name in the SRS server.
If accessing by SV
(GI
), DES
, KEY
or ORG
, did you remember to specify these when you indexed the database?
If accessing by SV
(GI
), DES
, KEY
or ORG
, did you specify fields:
?
Take another look at the format. Is that really fasta
, or is it ncbi
?
Do you have duplicate entries? The dbi* program indices must have unique entry names.