EMBOSS can be integrated with several common non-sequence biological databases. These are described in this section.
REBASE is the restriction enzyme database maintained by New England Biolabs. It is needed for programs such as remap and restrict. The latest version of Rebase can be obtained by anonymous FTP (ftp.neb.com/pub/rebase/). EMBOSS needs the withrefm
and proto
files. The data is extracted for EMBOSS with the program rebaseextract:
%
mkdir /site/prog/emboss/data/REBASE
% rebaseextract Extract data from REBASE REBASE database withrefm file: /data/rebase/withrefm.208 REBASE database proto file: /data/rebase/proto.208
Rebase is now installed and ready to use.
TRANSFAC is the transcription factor binding site database. It is available by anonymous FTP (ftp.ebi.ac.uk/pub/databases/transfac/). Unpacking the distribution reveals a file called site.dat
. This is the one EMBOSS needs. Run tfextract to extract the data from TRANSFAC:
%
tfextract
Extract data from TRANSFAC Full pathname of transfac SITE.DAT: /databases/transfac/site.dat
tfscan can now access the TRANSFAC database.
PROSITE is a database of regular expressions that match potentially diagnostic regions for structural/functional classification of proteins. EMBOSS needs this database for the patmatmotifs program. PROSITE can be obtained via anonymous FTP from the EMBL-EBI. Download the prosite.dat
and prosite.doc
files to the same directory. Then run prosextract to build the EMBOSS Prosite database specifying the download directory:
%
prosextract
Builds the PROSITE motif database for patmatmotifs to search Enter name of prosite directory: /data/prosite
PROSITE is now integrated into your EMBOSS installation.
PRINTS is a database of diagnostic patterns of blocks of sequence homology in protein families. The PRINTS database can be searched using the EMBOSS program pscan. PRINTS can be obtained via anonymous FTP from the EMBL-EBI. The database is made available as compressed files which should be uncompressed using gzip before integrating them into EMBOSS. PRINTS is integrated with EMBOSS using the program printsextract:
%
printsextract
Extract data from PRINTS Input file: /data/prints/prints28_0.dat
The PRINTS database is now integrated with EMBOSS.
An amino acid index is a set of 20 numerical values representing any of the different physicochemical and biological properties of amino acids. The AAindex1
section of the Amino Acid Index Database is a collection of published indices together with the result of cluster analysis using the correlation coefficient as the distance between two indices. This section currently contains 437 indices in release 4.0 of the database.
The EMBOSS programs pepwindow and pepwindowall plot hydrophobicity using the data from an Aaindex
entry. If Aaindex
is installed these programs can plot the other amino acid properties.
Aaindex
can be obtained via anonymous FTP (http://www.genome.jp/aaindex/ and is integrated with EMBOSS using the program aaindexextract:
%
aaindexextract
Extract data from AAINDEX Full pathname of file aaindex1: /data/aaindex/aaindex1
The AAINDEX database is now integrated with EMBOSS.
The CUTG database contains a series of codon usage tables calculated from GenBank. CUTG can be obtained via anonymous FTP from the EMBL-EBI server. CUTG is integrated with EMBOSS using the program cutgextract which writes files to the CODONS
data directory:
%
cutgextract
Extract data from CUTG CUTG directory [.]: /data/cutg/
The CUTG database is now integrated with EMBOSS.
Download and unzip the Archive.zip
file and then run jaspextract
specifying the FlatFileDir
directory.
%
jaspextract
Extract data from JASPAR JASPAR database directory [.]: /data/jaspar/all_data/FlatFileDir
Other data files should be kept in the data directory under the main EMBOSS installation.
Personal (user) data files can be kept in:
The current working directory
A subdirectory .embossdata
of the current directory
Their home directory
A subdirectory .embossdata
of their home directory
EMBOSS will search these locations in this order and will stop as soon as it finds a matching file. If the personal directories do not contain the desired file, EMBOSS will search the system-wide data directory (/share/EMBOSS/data/
).
Apparently inexplicable errors when running EMBOSS programs may be caused by the system not using the data files one expects. The search path can be displayed in search order using the command embossdata
.
For more information on EMBOSS data files, see the EMBOSS Users Guide.