4.4. Miscellaneous Database Integration

EMBOSS can be integrated with several common non-sequence biological databases. These are described in this section.

4.4.1. REBASE

REBASE is the restriction enzyme database maintained by New England Biolabs. It is needed for programs such as remap and restrict. The latest version of Rebase can be obtained by anonymous FTP (ftp.neb.com/pub/rebase/). EMBOSS needs the withrefm and proto files. The data is extracted for EMBOSS with the program rebaseextract:

% mkdir /site/prog/emboss/data/REBASE
% rebaseextract
Extract data from REBASE
REBASE database withrefm file: /data/rebase/withrefm.208
REBASE database proto file: /data/rebase/proto.208

Rebase is now installed and ready to use.

4.4.2. TRANSFAC

TRANSFAC is the transcription factor binding site database. It is available by anonymous FTP (ftp.ebi.ac.uk/pub/databases/transfac/). Unpacking the distribution reveals a file called site.dat. This is the one EMBOSS needs. Run tfextract to extract the data from TRANSFAC:

% tfextract
Extract data from TRANSFAC
Full pathname of transfac SITE.DAT: /databases/transfac/site.dat

tfscan can now access the TRANSFAC database.

4.4.3. PROSITE

PROSITE is a database of regular expressions that match potentially diagnostic regions for structural/functional classification of proteins. EMBOSS needs this database for the patmatmotifs program. PROSITE can be obtained via anonymous FTP from the EMBL-EBI. Download the prosite.dat and prosite.doc files to the same directory. Then run prosextract to build the EMBOSS Prosite database specifying the download directory:

% prosextract
Builds the PROSITE motif database for patmatmotifs to search
Enter name of prosite directory: /data/prosite

PROSITE is now integrated into your EMBOSS installation.

4.4.4. PRINTS

PRINTS is a database of diagnostic patterns of blocks of sequence homology in protein families. The PRINTS database can be searched using the EMBOSS program pscan. PRINTS can be obtained via anonymous FTP from the EMBL-EBI. The database is made available as compressed files which should be uncompressed using gzip before integrating them into EMBOSS. PRINTS is integrated with EMBOSS using the program printsextract:

% printsextract
Extract data from PRINTS
Input file: /data/prints/prints28_0.dat

The PRINTS database is now integrated with EMBOSS.

4.4.5. AAINDEX

An amino acid index is a set of 20 numerical values representing any of the different physicochemical and biological properties of amino acids. The AAindex1 section of the Amino Acid Index Database is a collection of published indices together with the result of cluster analysis using the correlation coefficient as the distance between two indices. This section currently contains 437 indices in release 4.0 of the database.

The EMBOSS programs pepwindow and pepwindowall plot hydrophobicity using the data from an Aaindex entry. If Aaindex is installed these programs can plot the other amino acid properties.

Aaindex can be obtained via anonymous FTP (http://www.genome.jp/aaindex/ and is integrated with EMBOSS using the program aaindexextract:

% aaindexextract
Extract data from AAINDEX
Full pathname of file aaindex1: /data/aaindex/aaindex1

The AAINDEX database is now integrated with EMBOSS.

4.4.6. CUTG

The CUTG database contains a series of codon usage tables calculated from GenBank. CUTG can be obtained via anonymous FTP from the EMBL-EBI server. CUTG is integrated with EMBOSS using the program cutgextract which writes files to the CODONS data directory:

% cutgextract
Extract data from CUTG
CUTG directory [.]: /data/cutg/

The CUTG database is now integrated with EMBOSS.

4.4.7. JASPAR

Download and unzip the Archive.zip file and then run jaspextract specifying the FlatFileDir directory.

% jaspextract
Extract data from JASPAR
JASPAR database directory [.]: /data/jaspar/all_data/FlatFileDir

See http://jaspar.genereg.net/html/DOWNLOAD/.

4.4.8. Miscellaneous Data Files

Other data files should be kept in the data directory under the main EMBOSS installation.

Personal (user) data files can be kept in:

  • The current working directory

  • A subdirectory .embossdata of the current directory

  • Their home directory

  • A subdirectory .embossdata of their home directory

EMBOSS will search these locations in this order and will stop as soon as it finds a matching file. If the personal directories do not contain the desired file, EMBOSS will search the system-wide data directory (/share/EMBOSS/data/).

Apparently inexplicable errors when running EMBOSS programs may be caused by the system not using the data files one expects. The search path can be displayed in search order using the command embossdata.

For more information on EMBOSS data files, see the EMBOSS Users Guide.