4.1. General Database Configuration

4.1. General Database Configuration
Prev	Chapter 4. Databases	Next

4.1.1. Sequence Database Support

EMBOSS provides excellent database support. All the common sequence formats you are likely to come across are supported. See the EMBOSS Users Guide.

A variety of indexing and access methods are supported. For example, EMBL entries can be read from :

A non-indexed EMBL-format flatfile held locally.
Original EMBL flatfiles using the CD-ROM, Staden or EMBOSS indexes
Original EMBL flatfiles using local SRS indexes
A file indexed for use with BLAST version 2 indexes
GCG database format
A query to the EMBL-EBI DBFETCH service
A query to the EMBL-EBI web server
A query to the Entrez web server
A query to any MRS web server
A query to any SRS web server (local or remote)
A relational database such as Sybase or Oracle by calling a local application.

Databases can be held locally and both indexed and non-indexed local files are supported. Tools for database indexing (Section 4.5, “Database Indexing”) are provided. One is a variation on the emblcd system, the other uses an updatable tree. They provide rapid access to single sequences and rapid queries of flat file databases. The dbi* indexing applications assume that you have one or both of ID and accession number in each record and that they are unique for the whole database index, whereas the dbx* applications can handle non-unique (duplicate) IDs and source files >2Gb in size. Use of the dbx* indexing applications is preferred.

EMBOSS also provides methods for retrieving sequences via the WWW. If sequences on a server are in a format unknown to EMBOSS, it might be possible to specify they are converted to FASTA format before they are served. There are three methods for interaction with a local SRS installation or SRS on a remote public server. SRS queries can be made not only by ID and Accession number, but also (depending on the way a database has been indexed) on words in the description line, sequence version (or GI numbers), keywords or organism names.

Specialised access methods are provided for databases served by MRS, NCBI's entrez and EMBL-EBI's dbfetch servers

For more general access through web servers, the url access method allows a database to be defined as a URL into which a user-specified ID is inserted.

For other non-flatfile databases or flat file databases in formats not currently supported by EMBOSS, it is possible to configure an external application to retrieve sequences.

4.1.1.1. Query Levels, Access Methods and Attributes

There are three basic levels of query:

entry: A single entry specified by database ID or accession number is retrieved.
query: One or more entries matching a wildcard string in the Uniform Sequence Address (USA, see the EMBOSS Users Guide) are retrieved (this can be slow for some methods).
all: All entries are read sequentially from a database.

One or more query levels may be specified for each database configuration.

There are many methods (Section 4.3, “Database Access Methods”) for accessing databases. The available methods depend on the query level: i.e. whether a single entry, a wildcard-specified set of entries or all of the database entries are to be retrieved. For example, a web server might be suitable for retrieving a single or few entries but probably, quite sensibly, will not allow an entire database to be retrieved over the Internet. In contrast, a flat file database with no index is often (depending upon its size) only useful for reading all the entries sequentially ('all' retrieval level).

A database can be defined with a single retrieval method using the method attribute. Alternatively, multiple methods may be defined, depending on which type (entry, query, all) of access is required. The attributes methodentry, methodquery and methodall are used for this. This would be essential in the cases described above, to access the database in the different locations.

In addition, each access method needs to know something about the database. What is needed will be different for each method, although there is, of course, much overlap between them. This information is specified by using the 'key: value' attributes. The required attributes depend on the access method and the query level.

Database key: value attributes and access methods (Section 4.3, “Database Access Methods”) are described below.

4.1.2. Configuring EMBOSS to work with Databases

Every database you intend to use must be defined in one of the EMBOSS configuration files:

emboss.default

.embossrc

emboss.default is kept in the top-level EMBOSS directory (e.g. /usr/local/emboss/share/EMBOSS/emboss.default) and is used for defining site-wide databases. In contrast, .embossrc lives in your home directory and is used for defining your own databases or, for example, testing database definitions before adding them to the site-wide emboss.default file.

Each database is configured using a database definition. The generalised form is:

DBNAME DatabaseName 
[
    key: value  
    key: value
    key: value
    key: value
]

DBNAME, which is usually shortened to DB, is followed by the database name (DatabaseName) then a set of key: value attributes that specify that database. The key: value attributes are all enclosed by a pair of square brackets.

The key: value pairs are the configuration options and must contain:

A description of the access method (using method:) or one or more of:
methodsingle:
methodquery:
methodall:
A description of the original format of the sequences (using format:).

Additional key: value pairs might be required depending on the access methods. Others are optional.

As an illustration, to set up direct access to the EMBL and SwissProt test databases distributed with EMBOSS, your emboss.default or .embossrc file should look like something like this:

DB embl 
[ 
type:    "N" 
method:  "direct"
format:  "embl" 
dir:     "/home/auser/EMBOSS-6.2.0/test/embl/"
file:    "*.dat" 
comment: "Test EMBL in EMBOSS distribution" 
]


DB swissprot
[ 
type:    "P"
method:  "direct"
format:  "swiss" 
dir:     "/home/auser/EMBOSS-6.2.0/test/swiss/"
file:    "seq.dat" 
comment: "Test Swissprot in EMBOSS distribution" 
]

Or to set up access to the EMBL and swissprot databases via SRS at the EMBL-EBI, your emboss.default or .embossrc file should look like this:

DB swissprot 
[ 
type:    "P" 
method:  "srswww" 
format:  "swiss"
url:     "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
comment: "Swissprot via EBI SRS" 
]

DB embl 
[ 
type:    "N" 
method:  "srswww" 
format:  "embl"
url:     "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
comment: "EMBL via EBI SRS" 
]

4.1.3. Example Database Definition File (`emboss.default.template`)

An emboss.default.template file is provided in the EMBOSS distribution. As its name suggests, it gives examples of some of the possible database definitions supported by EMBOSS (see the next section). An excerpt of the emboss.default.template file is show below:

#SET emboss_tempdata path_to_directory_$EMBOSS/test

# Logfile - set this to a file that any user can append to
# and EMBOSS applications will automatically write log information

#SET emboss_logfile /packages/emboss/emboss/log
 
# pir (cytochrome C plus first entries in other divisions)
# ===

DB tpir [ 
    type: P 
    dir: $emboss_tempdata/pir
    method: gcg
    file: pir*.seq
    format: nbrf
    fields: "des org key"
    comment: "PIR in 4 files in GCG format indexed by dbigcg" 
]

# Genbank (Remote access to an MRS server)
# =======

DB genbank [
    type: N
    methodentry: mrs3
    format: genbank
    dbalias: "genbank_release"
    url: "http://mrs.cmbi.ru.nl/mrs-3/plain.do"
    comment: "GenBank IDs via MRS"
]

# genbank (the first few entries from several sub-section files)
# =======

DB tgenbank [ 
    type: N 
    dir: $emboss_tempdata/genbank
    method: emblcd 
    format: genbank 
    release: 01
    fields: "sv des org key"
    comment: "GenBank native format indexed by dbiflat" 
]

4.1.4. Test Databases

To see how databases are set up under EMBOSS, you should look at the configurations for the test databases included in the EMBOSS distribution. The EMBOSS developers use these databases to test database indexing and sequence reading. They also contain the sequences that are used in the usage examples for the applications (see the application documentation online or by running tfm). They include:

test/data (emrod (DNA) and swnew (protein) are in BLAST format)
test/embl (*.dat for EMBL format, .ref and .seq for gcg format)
test/pir (.ref and .seq for nbrf format)
test/swiss (.dat for swissprot format, 1 file)
test/swnew (.dat for swissprot format, 3 files)
test/wormpep (wormpep is in FASTA and BLAST format)

The template file (emboss.default.template) in the EMBOSS distribution (e.g. /usr/local/emboss/share/EMBOSS/emboss.default.template) contains configurations for all the test databases. You can use emboss.default.template as a template for entries in your own emboss.default file. For any database definitions you use, change the definition of emboss_tempdata to point to your test directory and uncomment the line. You'll then be able to use the test databases as "tembl", "tsw" and so on.

One of the first things an EMBOSS application does when it runs is to read in the installed emboss.default (and then the ~/.embossrc file, if it exists). This means that any changes to these definition files take effect as soon as they are made.

For example, change:

# swissprot (Puffer fish entries)
# =========

DB tsw [ type: P dir: $emboss_tempdata/swiss
   method: emblcd format: swiss release: 36
   fields: "sv des org key"
   comment: "Swissprot native format with EMBL CD-ROM index" ]

# swissprot (Puffer fish entries)
# =========

DB tsw [ type: P dir: /home/auser/EMBOSS-6.2.0/test/swiss
   method: emblcd format: swiss release: 36
   fields: "sv des org key"
   comment: "Swissprot native format with EMBL CD-ROM index" ]

Alternatively, to get all the test databases supported, rename or copy emboss.default.template to emboss.default and edit the file as follows. This line:

# SET emboss_tempdata path_to_directory_$EMBOSS/test

must be uncommented and the definition changed to the directory where the databases are installed. In the following example this is /usr/local/share/EMBOSS/test. For example:

SET emboss_tempdata /usr/local/share/EMBOSS/test
# or
SET emboss_tempdata /home/auser/workspace/emboss/emboss/test/
# or something else

Note

The directory where the test databases are installed can be changed with --prefix when you configure EMBOSS.

4.1.5. Testing your Database Definitions

Having defined your databases (see Section 4.1, “General Database Configuration”), you can run showdb -full and you should see them all appear in the list of databases. If the message Warning: Bad database definition is generated or if a database doesn't appear then something is seriously wrong with your definition. Go back to it and check things. Common mistakes include:

Have you left off the terminal square bracket ] ?
Did you leave out a colon character : in an attribute?
Have you forgotten to put in the closing quotes around some text?
Is the emboss.default file world-readable?

If showdb displays your database, check that all of your required access methods are listed as OK. If something is not OK then another access method might be required.

Just because showdb finds a database definition does not mean the database is working correctly: showdb does not attempt to extract any entries from your database. Therefore you should try extracting one or more known entries from the database using seqret. If you get errors, you should check that the database is set up correctly and defined correctly. Things to check include:

Are the data files and indexes world-readable?
If using method: emblcd, gcg, blast or emboss did you index the data files?
If using app: is the application in your PATH?
If using app: is the PATH specified correctly?
If using app: is the application world-executable?
If using url: or srswww is the server up?
If using url: or srswww is the server URL correct?
Are file: wildcards specified correctly?
Are directory: paths specified correctly?
Have you put the files there yet?
If using any SRS method, did you use dbalias:?
If using any SRS method, check the dbalias: name in the SRS server.
If accessing by SV (GI), DES, KEY or ORG, did you remember to specify these when you indexed the database?
If accessing by SV (GI), DES, KEY or ORG, did you specify fields:?
Take another look at the format. Is that really fasta, or is it ncbi?
Do you have duplicate entries? The dbi* program indices must have unique entry names.