To gain experience in database indexing under EMBOSS, you can practice with the example databases included in the EMBOSS distribution. These include:
test/data
test/embl
test/pir
test/swiss
test/swnew
test/wormpep
You can reindex these files using the dbx* or the dbi* programs.
The dbx* applications are preferred.
The dbx* programs require two variables to be set in the emboss.default
file and at least one Resource Definition to be present. In contrast the dbi* programs do not require these definitions.
For example:
SET PAGESIZE 2048 SET CACHESIZE 200 RES embl [ type: Index idlen: 15 acclen: 15 svlen: 15 keylen: 25 deslen: 25 orglen: 25 ]
The dbx* applications buffer disc pages in order to improve performance. The PAGESIZE
should usually be set to the size, in bytes, that your operating system uses to buffer disc pages, though the value is not critical. The CACHESIZE
should be set to the number of such pages that you wish to be cached. The values of 2048 and 200 given above are good general purpose ones. We recommend a CACHESIZE
greater than 100.
You should have at least one Resource RES
Definition in your emboss.default
file, though we recommend having one per database you wish to index. The dbx* programs will ask for the name of a RES
entry when they run. The definitions have a compulsory type: Index
attribute followed by length attributes for each of the fields that can be indexed. These lengths represent the maximum length of the field before potential truncation occurs. Truncation of ID keys is usually to be avoided as it can lead to duplicate IDs being indexed. It is appropriate to set the idlen
, acclen
and svlen
attributes a little larger than the maximum size field you expect in the source file. Values for keylen
, deslen
and orglen
are more a matter of preference.
Flatfile databases are plain text files in a defined format such as those released by EMBL, GenBank etc. The EMBOSS program dbxflat is used to generate EMBOSS indexes that can be used for all types of database access. The dbiflat application can also be used but cannot cope with large source database files (greater than 2Gb) or duplicate IDs or ACs.
dbxflat (and the EMBOSS access method) requires the databases to be uncompressed. The examples given here will not probe the deeper secrets of dbxflat (for which the reader is referred to the application documentation, or failing that the source code) but will show a typical installation for a common database.
We assume that EMBOSS has been installed and works. This can be tested with the command:
wossname -auto |
which should list all the programs available.
In this example you will index and configure the EMBL database for use with EMBOSS. First download and unpack the EMBL database. This will require a considerable amount of disc space. If you do not have sufficient space available then just download a subset of the database. Use cd
to move the directory in which you have unpacked EMBL. This should look something like this when you run ls
:
%
ls
. rel_est_fun_01_r98.dat rel_est_fun_02_r98.dat rel_est_fun_03_r98.dat . Output truncated . wgs_cabc_pro.dat wgs_cabd_mam.dat wgs_cabe_fun.dat
Run dbxflat to create the EMBOSS indices. This assumes you have set up a RES
definition and cache and page sizes as described above.
%
dbxflat
Index a flat file database using b+tree indices Basename for index files: embl Resource name: embl EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GB : Genbank, DDBJ REFSEQ : Refseq Entry format [SWISS]: EMBL Wildcard database filename: *.dat Database directory [.]: . id : ID acc : Accession number sv : Sequence Version and GI des : Description key : Keywords org : Taxonomy Index fields [id,acc]: id,acc General log output file [outfile.dbxflat]: embllog.dbxflat
dbxflat should happily chug away for some considerable time (depending on the speed of your machine) and will generate (eventually) the following index files:
%
ls
embl.ent embl.xid embl.xac embl.pxid embl.pxac embllog.dbxflat
Now create an entry in the EMBOSS configuration files to access the database. It is probably a good idea to try new database definition in your local configuration file first. Put the following entry in your .embossrc
:
DB embl [ type: "Nucleotide" method: "emboss" format: "embl" directory: "$emboss_db_dir/embl" filename: "*.dat" release: "98.0" comment: "EMBL release 98.0" ]
You will have needed to predefine $emboss_db_dir
somewhere in your emboss.default
or .embossrc
using a directive such as:
set emboss_db_dir /path_to_databases |
Save .embossrc
and try running showdb
. You should see a line that looks like:
%
showdb
.. output deleted embl N OK OK OK EMBL release 63.0 .. output deleted
It can be a good idea to set up subsections of the database so that end-users can search just the regions they wish to search. This section applies to all access methods (Section 4.3, “Database Access Methods”) that use EMBOSS style indexes and to others as well (e.g. EMBLCD).
Files can be included with the declaration:
filename: |
or excluded with the declaration
exclude: |
In order to just take the EST files in our EMBL database try the following:
DB emblest [ type: "Nucleotide" method: "emboss" format: "embl" directory: "$emboss_db_dir/embl" filename: "rel_est*.dat" release: "98.0" comment: "EMBL release 98.0" ]
Files can also be given as a space-separated list enclosed in quotes. For example, to set up a database of all mammalian sequences (except genomes) try the following:
DB emblallmam [ type: "Nucleotide" method: "emboss" format: "embl" directory: "$emboss_db_dir/embl" filename: "rel_std_rod*.dat rel_std_mus*.dat rel_std_hum*.dat rel_std_mam*.dat" release: "98.0" comment: "EMBL release 98.0" ]
As you can see from these two examples, the filename:
tag takes a space delimited list of filenames enclosed in quotes that can contain normal wildcard (?*
) characters. It can be quite tedious to set up a long list of sequences to search. In many cases you can use the exclude:
tag to make things easier:
DB emblnoest [ type: "Nucleotide" method: "emboss" format: "embl" directory: "$emboss_db_dir/embl" filename: "*.dat" exclude: "rel_est*.dat" release: "98.0" comment: "EMBL release 98.0" ]
This configures the emblnoest database to contain all of EMBL except the EST's.
EMBOSS can access GCG formatted databases, thus avoiding having multiple copies of the same databases in different formats for those who still use GCG alongside the flatfiles. EMBOSS creates b+tree indices for the GCG format databases using the program dbxgcg. This runs in much the same way as dbxflat. You will need the GCG format .seq
and .ref
files in order to create an EMBOSS indexed database.
Move to the GCG database directory containing your data and run dbxgcg:
%
dbxgcg
Index a GCG formatted database Basename for index files: emblgcg Resource name: embl EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GENBANK : Genbank, DDBJ PIR : NBRF Entry format [SWISS]: embl Database directory [.]: Wildcard database filename [*.seq]: *.seq Wildcard database filename [*.seq]: id : ID acc : Accession number sv : Sequence Version and GI des : Description key : Keywords org : Taxonomy Index fields [id,acc]: General log output file [outfile.dbxgcg]: emblgcglog.dbxgxg
When dbxgcg prompts for the entry format:
Entry format [EMBL]:
you should enter the original database format before you ran embltogcg or similar to generate the GCG databases. The program will run for a while and will then generate the EMBOSS index files for the GCG format database.
The following entry should be put in your .embossrc
file:
DB gcgembl [ type: "Nucleotide" method: "embossgcg" format: "embl" directory: "$emboss_db_dir/embl" filename: "*.dat" release: "98.0" comment: "EMBL release 98.0" ]
showdb
should show your newly configured database.
You can configure subsets of the databases in the same way as for the original format databases, as described above. One difference to dbxflat indexing is that both the .seq
and .header
files are listed in the [database].ent
file. The filename:
and exclude:
directives should therefore be of the form:
exclude: */rel_est*
instead of just:
*/rel_est*.seq
BLAST format databases are generated for efficient homology searching using the BLAST programs. It can be convenient to avoid redundant copies of databases so EMBOSS provides a mechanism for accessing these databases.
BLAST format databases are those generated using the tools distributed with NCBI-BLAST or with WU-BLAST.
For indexing of one BLAST database, move to the directory containing your BLAST format databases and run dbiblast:
%
dbiblast
Index a BLAST database Database name: blastsw Database directory [.]: database base filename [blastsw]: Release number [0.0]: Index date [00/00/00]: N : nucleic P : protein ? : unknown Sequence type [unknown]: p 1 : wublast and setdb/pressdb 2 : formatdb 0 : unknown Blast index version [unknown]: 2
The program will run for a while and will then generate the EMBLCD index files for the BLAST format database.
The following entry (or one like it that is more appropriate to your particular installation) should be put in your .embossrc
file:
DB blastsw [ type: "Protein" method: "blast" format: "ncbi" directory: "$emboss_db_dir/blastsw" filename: "blastsw" release: "38.9" comment: "BLAST format Swissprot" ]
showdb should show your newly configured database.
Because of the way BLAST works, many sites may group their BLAST databases in the same directory. You can index these in situ with dbiblast but this may require some extra steps if your databases are not of the same type; generation of subsequent index files will overwrite those that already exist. To avoid overwriting of index files you can index many databases with one set of index files, or you can use the -indexdir
options to place the indexes in a different directory.
There are two requirements for indexing several databases together in one index. The first is that the databases are the same type (protein/nucleic acid) and generated with the same tool (pressdb or formatdb); the second is that all the ID and accession numbers in the combined databases are unique.
Run dbiblast as before but specify all the databases you wish to be included when prompted for the database filename:
%
dbiblast
Index a BLAST database Database name: alldbs Database directory [.]: database base filename [alldbs]: dbone dbtwo dbthree dbfour Release number [0.0]: Index date [00/00/00]: N : nucleic P : protein ? : unknown Sequence type [unknown]: p 1 : wublast and setdb/pressdb 2 : formatdb 0 : unknown Blast index version [unknown]: 2
These can then be configured by using the filename:
and exclude:
tags as appropriate.
When you have databases of different types, generated with different programs or where the ID/accession numbers are duplicated between databases the preferred strategy is probably to keep the source data for the individual databases in separate directories and index them there.
Alternatively you can place the index files in a separate directory. This requires that you run dbiblast with the -indexdirectory
and set the indexdirectory:
tag in the database configuration to point to the correct database.
The example below illustrates database configuration using the indexdir
options:
%
dbiblast -indexdir /databases/indices/mydb
Index a BLAST database Database name: mydb Database directory [.]: database base filename [mydb]: Release number [0.0]: Index date [00/00/00]: N : nucleic P : protein ? : unknown Sequence type [unknown]: p 1 : wublast and setdb/pressdb 2 : formatdb 0 : unknown Blast index version [unknown]: 2
The corresponding entry in .embossrc
or emboss.default
would look like:
DB mydb [ type: "Protein" method: "blast" format: "ncbi" directory: "$emboss_db_dir/blastsw" indexdirectory: "/databases/indices/mydb" filename: "mydb" release: "1.0" comment: "My BLAST DB with an index in a different directory" ]
Again, multiple indexes cannot coexist in the same directory so care should be taken when using the -indexdir
option that an existing database index is not overwritten.
The FASTA specifications just define the sequence file as a header line that begins with >
and subsequent lines contain the sequence. The header line can be present in a seemingly infinite number of formats, several of which can be processed by EMBOSS. EMBOSS attempts to determine the accession number and/or ID for each sequence. For indexing purposes there is no semantic difference between an accession number and an ID. In the real world, accession numbers should be immutable, i.e. they do not change with subsequent releases of the database, but IDs may change.
One of the programs that can be used to process FASTA format databases is dbxfasta. It can recognise the following header line formats, specified on the command line:
simple.
>id ...
idacc.
>id accno ...
gcgid.
>db:id ...
gcgidacc.
>db:id acc ...
dbid.
>db id ...
ncbi.
>...[|accno]|id ...
Other header formats will not be recognised by dbxfasta and will cause indexing and/or database lookup to fail. If you have a header format that dbxfasta cannot yet handle you have two options:
(The preferred option) Get a C programmer to modify the source code for dbxfasta and recompile. If you are a community-spirited person you will also contribute these changes to the main EMBOSS source tree. (email emboss-dev@emboss.open-bio.org for more information on contributing changes to the EMBOSS source code and/or read the EMBOSS developers documentation)
(The quick hack) Write a custom script (using e.g. BioPerl http://www.bioperl.org) to access your database and use method: external
to configure it. This is less desirable as you may be limited in the access modes you can use.
To index a FASTA format database, run dbxfasta:
%
dbxfasta
Index a fasta file database using b+tree indices Basename for index files: mydb Resource name: myresdef simple : >ID idacc : >ID ACC or >ID (ACC) gcgid : >db:ID gcgidacc : >db:ID ACC dbid : >db ID ncbi : | formats ID line format [idacc]: idacc Database directory [.]: Wildcard database filename [*.dat]: mydb.fasta id : ID acc : Accession number sv : Sequence Version and GI des : Description Index fields [id,acc]: id,acc General log output file [outfile.dbxfasta]: mydb.dbxfasta
dbxfasta will run for a while and will produce the index files. You can use the same -indexdir
options as for dbxflat, dbxgcg and dbiblast to place the indexes in a different directory.
Place (e.g.) the following entry in your .embossrc
:
DB mydb [ type: "Protein" method: "emboss" format: "fasta" directory: "$emboss_db_dir/mydb" filename: "mydb.fasta" comment: "My database" ]
format:
should be dbid
, ncbi
or fasta
(the latter for every format except dbid
or ncbi
. The same filename:
and include:
tags can be used as for the other database indexing programs.
Many institutions may have local databases set up in their own Laboratory Information Management System. EMBOSS provides a simple mechanism for interfacing with such systems.
As long as a program is available that can be called noninteractively and returns the specified sequence on standard output, EMBOSS can interface with it. Use method: app
and app:
. The ID given in the USA will be appended to the command used to run the program. It is often best to specify the methods available using the method subsets, program command
methodall:
, methodquery:
and methodsingle:
rather than using the generic method:
tag.
SRS is a powerful database querying system that can cross reference different databases, launch applications etc. SRS can be run either through a web interface (see the description of the SRSWWW
method above for an example) or via the command line program getz. Indexing and configuring databases for SRS is not described here, just how to connect to preconfigured and indexed SRS databases. If getz is already within the scope of your PATH
environment variable then insert the following (or similar) into your .embossrc
file:
DB emblgetz [ type: N method: srs release: "98" format: embl comment: 'EMBL using getz' dbalias: embl app: getz ]
This will provide access to the SRS database embl
as emblgetz:acc
. If the SRS database has a different name from the DBNAME
(as is the case here) then the dbalias:
tag should be used to access the correct SRS database.
This configuration can be extremely slow for the all
access mode. It is probably a better idea to set up the database as follows:
DB emblgetz [ type: "Nucleotide" methodquery: "srs" release: "63" format: "embl" comment: "EMBL using getz" dbalias: "embl" app: "getz" methodall: "direct" filename: "*.dat" directory: "$emboss_db_dir/embl" ]
This will use method: srs
for the query
access mode but will use method: direct
for the all
access mode, thus speeding up reading of the whole database.
The SRSFASTA
access method is identical to the normal SRS
method except that it returns the sequence in FASTA format and so does not need a format:
tag.
You might notice that the index files produced by the dbx* applications can be very large. This is normal and is a consequence of three things. First, a tree structure is used, secondly the tree isn't tightly packed and thirdly 64-bit pointers are used throughout. The first will allow on-the-fly updating of the index, the second is for speed of construction/updating and the third is obvious. Another consideration is that, in some cases, the indexes are trees-of-trees to allow duplicate codes to be indexed (e.g. keywords).