A feature is a region of interest in a molecular sequence. Features include things like restriction enzyme cut sites, protein secondary structure prediction states, exon positions, regions of motif matches etc.
A vast number of programs generate features in one form or another, leading to a huge number of file formats used for features. The output types range from graphical displays of where restriction enzymes cut, to probabilities of the three states of a protein secondary structure prediction along a sequence, to rigidly defined text tables of the start and end positions of predicted exons or motif matches.
To handle the diversity, EMBOSS, where possible, uses the well defined and flexible feature formats that were developed for the major sequence databases:
EMBL, Genbank, DDBJ
Swissprot
PIR, NBRF
GFF GFF3 format
GFF2 the older and less strict GFF2 format
The feature formats used by EMBOSS are identical to that used in the sequence database formats of the same name, e.g. EMBL feature format is the same as the (subset of the) EMBL sequence database format. This holds true regardless of whether features are written together with their sequence or in a raw feature table (see below).
The support for a set of standard feature formats enables programs to be interoperable; being able to read or write each other's output without the need for file interconversion. As the EMBOSS project matures, the feature formats will become the default way of reporting features. This will also give a consistent look and feel, helping you to compare features in different sequences and from different programs more easily. For descriptions and examples of the supported formats see Section A.2, “Supported Feature Formats”.
The supported feature formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented) and Description (short description of the format).
Output Format | Nuc | Pro | Description |
---|---|---|---|
embl | Yes | No | embl/genbank/ddbj format |
gff2 | Yes | Yes | GFF version 1 or 2 |
gff3 | Yes | Yes | GFF version 3 |
pir | No | Yes | PIR format |
swiss | No | Yes | SwissProt format |
Output Format | Nuc | Pro | Description |
---|---|---|---|
dasgff | Yes | Yes | DAS GFF format |
debug | Yes | Yes | Debugging trace of full internal data content |
embl | Yes | No | embl format |
genbank | Yes | No | genbank format |
gff | Yes | Yes | GFF version 3 |
gff2 | Yes | Yes | GFF version 2 |
pir | No | Yes | PIR format |
swiss | No | Yes | SwissProt format |
In EMBOSS, a feature is a region of interest in a nucleic or protein sequence and is described by:
Name describing the feature
Start and end position
The sense (in a nucleic sequence)
The reading frame (in a translated nucleic sequence)
A score
Features may also explicitly or implicitly hold the name of the program or database that they are derived from and various other descriptive data (see the EMBOSS Developers Guide).
A feature table is simply a group of features. They are stored in one of three ways:
As part of a sequence file
As part of a database entry
As a raw feature table: a file that does not contain the sequence the features refer to.
Most feature table definitions have a controlled vocabulary, i.e. there is a specified list of feature key names that can be used to label features. This means that a software developer cannot edit feature tables to add in features with new keys. If a feature table is edited, one must stick to the allowed set of feature keys.
Some applications for handling generic sequence features are summarised below (???).
Application | Description |
---|---|
coderet | Extract CDS, mRNA and translations from feature tables. |
extractfeat | Extract features from a sequence. |
maskfeat | Mask off features of a sequence. |
showfeat | Show features of a sequence. |
twofeat | Finds neighbouring pairs of features in sequences. |
In addition, the diffseq and seqret applications also handle the feature table of an input sequence.
The Uniform Feature Object or UFO (Section 6.7, “The Uniform Feature Object (UFO)”) is the standard way, used by EMBOSS, of referring to feature input and output files on the command line. A UFO is used to specify a feature file by name and by the format of the features in the file.
Various qualifiers are provided for flexible handling of features on the command line (see Section 6.4, “Datatype-specific Command Line Qualifiers”). These allow you to set such things as file name and format and the region of the sequence containing the features of interest.
For example, the format of input features is specified with -fformat
, where Format
Format
is the name of a supported feature format. Here, embl format is specified:
extractfeat myfile.feat -fformat embl |
This could also have been specified in the UFO (Section 6.7, “The Uniform Feature Object (UFO)”) of the output sequence:
extractfeat embl:myfile.feat |