There are a few things to consider when specifying attributes for a database:
Each database must have attributes that specify what it is and how to access it. This information is given as a set of pairs of key:
and value
attributes. These attributes are held in the DB
definition structure (see above).
The key: value
pairs in a DB
structure can be specified either on separate lines or separated by spaces on the same line.
If the value
part of the attribute contains spaces then it should be quoted to prevent it being prematurely terminated at the first space. For example, key: "value with many words in"
.
The minimum set of attribute keys are method:
and format:
- these two are mandatory. It is also typical (but not mandatory) to specify the type:
attribute.
Some forms of method:
require subsidiary attributes giving further information on how to access the data.
The available attributes are described below (Table 4.1, “Attributes used to Specify a Database”).
Key | Value | Description |
---|---|---|
method methodall methodentry methodquery | srs srsfasta srswww url app external direct emblcd emboss entrez gcg embossgcg blast dbfetch mrs direct | Specifies the method used to access the database. |
format formatentry formatquery formatall | A valid sequence format name (see the EMBOSS Users Guide) | Specifies what sequence format to expect when reading entries from the database. |
type | N or P | Specifies whether the database is nucleic or protein. |
fields | One or more of: sv , des , org , key | Specifies which search fields have been indexed and are available for searching with. |
directory | Any valid directory path | Specifies the directory of files that have been specified with the filename: attribute. It also specifies the default directory of indexes and files produced by the dbi* and dbx* indexing programs (see indexdirectory: ). |
filename | A file name (may be wildcarded) or list of file names | Specifies the sequence file(s) to read in when accessing the database. |
exclude | A file name (may be wildcarded) or list of file names | This is used to exclude a subset of files from consideration. |
indexdirectory | Any valid directory path | Specifies the directory of index files (produced by the dbi* and dbx* programs) if this is different to the directory specified by directory: . |
url | Any valid URL | Specifies the URL to use when getting sequences from remote Web sites. |
httpversion | 1.0 or 1.1 | Specifies the HTTP protocol version to be used. Version 1.0 transmits the results in one block. Version 1.1 chunks data and is preferred for large data transfers. The default is 1.1 |
proxy | host :port | In the access methods srswww and url , you can specify a proxy host and port to use when accessing the URL. If a proxy is globally defined, it can be bypassed for any database by specifying ":" as an empty value. |
app appentry appquery appall | Any script or program name | Specifies the name or command line of an external (i.e. non-EMBOSS) program or script (application) that should be run to extract the sequence from the database. |
dbalias | The true name of a database | This is used to specify the name of a database at a (e.g. SRS) site where the name differs from the name that given as the DBNAME . This allows the EMBOSS database definition to use another name (e.g. srsembl) or to specify a less obvious name when contacting the server (e.g. emblrelease) |
caseidmatch | Used to flag databases that have case-sensitive identifiers | A boolean set to "Y" to define a database where identifiers can differ only in upper or lower case characters. An example is a sequence database derived from PDB entries where the chain identifiers 'a' and 'A' are not the same. |
hasaccession | Used to flag databases that do not have access by accession number | A boolean set to "N" to define a database with no accession numbers (e.g. PDB used as a source of sequence data) |
comment | Any text | A comment, usually to describe the database. |
release | Any text | This is the release number or date. |
This specifies the method used to access the database.
This field is mandatory - there must be at least one form of the method
key specified. More than one different type of method key can be specified.
If method:
is specified, then this is the default method covering all forms of access ('query', 'entry' or 'all'). Specific methods for the 'query', 'entry' or 'all' forms of access (i.e. methodquery:
, methodentry:
or methodall:
) should be specified explicitly if you wish to have several ways of accessing the data e.g.
method: "emblcd" methodall: "direct"
The format:
attribute specifies what sequence format to expect when reading entries from the database.
This attribute is mandatory. If you need to specify different formats for any of the different access methods (Section 4.3, “Database Access Methods”), then you may use the variants of format:
with the suffix entry
, query
or all
. An example of format
is:
format: ncbi
This specifies whether the database is nucleic or protein.
Although it is not strictly required, it is normal to specify the type of the database as this should be known. If the type is not specified it will be determined by the EMBOSS applications when they read sequences in. (You will not get error messages when you run showdb as this doesn't read in sequences.) The value Nucleotide
or N
specifies a nucleic database, Protein
or P
specifies a protein database, e.g.
type: "Nucleotide"
This specifies which search fields have been indexed and are available for searching.
It is assumed that Accession number and ID name are always available when a database is set up. Depending how you set up the database, access by one or more of these fields might be possible:
sv - Sequence Version or GI Number |
des - Description line |
org - Organism's taxonomic classification |
key - Keywords |
The access methods srs
, srsfasta
and srswww
allow access to these search fields. The methods emboss
, emblcd
and gcg
may or may not have some or all of these fields indexed, depending on the parameters given to the programs dbxflat, dbxgcg, dbiflat and dbigcg. The programs dbxfasta, dbiblast and dbifasta only allow you to select any of sv
, des
and acc
(the default). An example specification is:
fields: "sv des org key"
The use of these fields in searches is described elsewhere (see the EMBOSS Users Guide).
FASTA format has only an ID and a parsable description line. If accession numbers are not defined then set hasaccession: "N"
to turn off the default attempt to include this field in searches. A common case is the PDB protein structure database when used as a source of sequences, as PDB has no accesion number system.
This specifies the directory of files that have been specified with the filename:
attribute. It also specifies the directory of indexes and files produced by the dbx* or dbi* programs.
It is only required with the access methods (see Section 4.3, “Database Access Methods”):
emboss |
direct |
gcg |
emblcd |
blast |
It is common to use variables (see the EMBOSS Users Guide) to specify part or all of the path:
directory: $dbdir/genomes
This specifies the sequence file(s) to read in when accessing the database.
It is only required with the access method direct
(see Section 4.3, “Database Access Methods”). It may also be used with the access methods:
emboss |
gcg |
emblcd |
blast |
to indicate which files should be included back in after using the exclude:
attribute to specify which indexed files should be ignored. (See exclude:
below). The files may be wildcarded using *
. The attribute key filename:
is commonly abbreviated to file:
e.g.
file: pir*.seq
A list of file names may also be given; each name must be separated with a space or comma.
This is used to exclude a subset of files from consideration.
To exclude certain files, specify exclude: *file*
. This is used in conjunction with filename:
to specify a subset of files in a directory. Exclude:
is checked first, then the rest of the files are included with filename:
. The files searched are therefore: - the files in the directory specified by directory:
- but not the exclude:
files (if any) - but include back the filename:
files (if any) e.g.
exclude: mouse.*
If you have indexed all of the files in the EMBL database, then you can specify subsets using the same set of files and indexes as:
DB embl [ type: "N" format: "embl" method: "emblcd" dir: "/data/embl" comment: "All of EMBL" ] DB emblminus [ type: "N" format: "embl" method: "emblcd" dir: "/data/embl" exclude: "est*.dat" comment: "EMBL without the ESTs" ] DB emblhumest [ type: "N" format: "embl" method: "emblcd" dir: "/data/embl" exclude: "*.dat" filename: "est_hum*.dat" comment: "EMBL human ESTs" ] DB human [ type: "N" format: "embl" method: "emblcd" dir: "/data/embl" exclude: "*.dat" filename: "hum*.dat" comment: "EMBL human" ]
This specifies the directory of index files (produced by the dbx* or dbi* programs) if this is different to the directory specified by directory:
.
For the dbi* applications it is sensible to hold the indexes in a different directory to the one holding the sequence database files when you have many sequence databases in the same directory. This is because the indices for every database all have the same names (acnum.hit
, acnum.trg
, division.lkp
, etc.) and these would be over-written if you have indexed several databases in the same directory. In this case, you should create the indices in a different directory (often but not necessarily a subdirectory) for each database. That way the index files will not become confused. These index directories can be specified using the attribute indexdirectory:
, while the directory containing the sequence data files can still be specified using dir:
.
It is only used with the access methods (see Section 4.3, “Database Access Methods”):
emboss |
gcg |
emblcd |
blast |
It is common to use variables to specify part or all of the path. The attribute key indexdirectory:
is commonly abbreviated to indexdir:
e.g.
indexdir: $dbdir/genomes/embl
This specifies the URL to use when retrieving sequences from remote Web sites.
It is only required with the access methods (see Section 4.3, “Database Access Methods”):
srswww |
url |
The database (or the name specified in a dbalias
attribute) and entry Accession number (or Sequence version, GI number, Description, Organism, or Key-word) can then appended to create a functional SRS query line. Often it is only necessary to specify the remote wgetz application alone e.g.
url: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
The URL can also contain one or more instances of the character pair %s
- each of these pairs are replaced by the value of the ID name when this database is accessed. Any HTML formatting will be stripped from the resulting web page e.g.
url: "http://www.ebi.ac.uk/htbin/emblfetch?%s" # or url: "http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=s&form=6&dopt=g&html=no&uid=%s"
The URL must begin with http://
and have a lower case host address.
In the access methods srswww
mrs
entrez
dbfetch
and url
, you can specify a proxy host and port to use when accessing the URL
For example:
proxy: "proxy.mydomain.com:8888"
If the global variable EMBOSS_PROXY
is defined in the emboss.default
file (see the EMBOSS Users Guide) then the attribute
proxy: ":"
will turn off proxy access for this database. This is useful if the database is on an internal server.
In the access methods srswww
mrs
entrez
dbfetch
and url
, you can specify the HTTP prpotocol version to use when accessing the URL. The default version 1.1 supports delivery of results in chunks. The older 1.0 protocol can only deliver all results in one block.
For example:
httpversion: "1.0"
If the global variable EMBOSS_HTTPVERSION
is defined in the emboss.default
file (see the EMBOSS Users Guide) then this nwill set a global default for all URL-based data access. The default is 1.1.
This specifies the command line of an external (third party) application that should be run to extract a sequence from a database.
This application can be in the user's path or have an explicit path provided. The database and entry name will be appended to the application command as
. Both ID and Accession number can be used to specify the entry. Alternatively, if the application
dbname
:entry
app:
attribute value contains the character pair %s
, it is replaced by the value of the ID name or Accession number when this database is accessed.
This attribute is only required with the access method app
(see Section 4.3, “Database Access Methods”). If you need to specify different applications for any of the different access methods, then you may use the variants of app:
with the suffix entry
, query
or all
. e.g.
app: efetch
# orapp: "getz [embl:%s]
"
This is used to specify the name of a database at a (e.g. SRS) site where the name differs from the DBNAME
.
It is only required with the access methods (see Section 4.3, “Database Access Methods”):
mrs |
mrs3 |
srswww |
srsfasta |
srs |
e.g.
dbalias: emblnew
This is a comment to describe the database.
It is displayed in showdb e.g.
comment: "This is my subset of refseq"
This is the release number or date.
It is displayed in showdb.
Unless you are zealous in updating release:
values, this will rapidly become out of synch with the actual data.
The dbx* and dbi* indexing programs ask for the database name, release number and index date. These are stored in the index files. This information is not available to EMBOSS programs and is not reported by showdb. They are part of the index file formats, but EMBOSS does not currently make use of them.
release: "21.0 (Oct 2009)"
This turns off attempts to read data by accession number.
Most sequence databases follow the example set by the major public protein and nucleotide reosurces by providing unique accession numbers. Where these are not available the accession number search can be disabled by defining
hasaccession: "N"
This makes identifier tests case-sensitive.
Most sequence databases attach no significance to upper or lower case for identifiers. In a few case, especially in site-specific local data, there may be a distinction between two otherwise identical names. An early example was a database of sequences derived from PDB where the chain name 'a' or 'A' in the identifier was significant.
caseidmatch: "Y"