A very wide variety of data formats are currently used. Often these share common records in the data reported, but format differences complicate their comparison. The interoperation of different programs requires the output from one application to be reformatted so that it's suitable for input to another application. This requires programming skills that a typical user of EMBOSS does not possess, or otherwise must be done manually which is very time consuming. EMBOSS aims to mitigate these difficulties by using a standard set of report formats for application output where possible.
Report formats have a consistent look and feel making it easier to compare the output from different programs. It is often convenient to have different report formats produced by the same program for different purposes. Depending on what you want to do with the result, it might be better to have a human readable report for publication purposes, or a more terse (and less-readable) report for input into another program for further analysis.
Report formats take their origin from the need to deal with EMBL, GenBank and PIR feature tables. It was therefore a natural choice to extend these to cope with other output data. All of the standard sequence feature tables (Section A.2, “Supported Feature Formats”) are also report formats. The format include:
EMBL
GenBank
GFF
PIR
SwissProt
There are other formats to cater for more than standard sequence information. The format name is the same as the program that first generated that style of output, for example "dbmotif", "diffseq" and so on. The output options are very flexible. For descriptions and examples of the supported formats see Section A.4, “Supported Report Formats”.
The supported report formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Header (whether the standard EMBOSS report header is included), Seq (whether the sequence corresponding to the features is included), Tags (number of specific tag-values reported. A non-zero value suggests a format is not suitable for application output that does not generate these specific tags.) Description (short description of the format).
Output Format | Nuc | Pro | Header | Seq | Tags | Description |
---|---|---|---|---|---|---|
dasgff | Yes | Yes | No | No | 0 | DAS GFF feature format |
dbmotif | Yes | Yes | Yes | Yes | 0 | Motif database hits |
debug | Yes | Yes | Yes | Yes | 0 | Debugging trace of full internal data content |
diffseq | Yes | Yes | Yes | Yes | 7 | Differences between a pair of sequences |
embl | Yes | No | No | No | 0 | EMBL feature format |
excel | Yes | Yes | No | No | 0 | Tab-delimited file for import to Microsoft Excel |
feattable | Yes | Yes | No | No | 0 | EMBL format feature table with internal tags |
genbank | Yes | No | No | No | 0 | Genbank feature format |
gff | Yes | Yes | No | No | 0 | GFF feature format |
listfile | Yes | Yes | Yes | No | 0 | EMBOSS list file of sequence USAs with ranges |
motif | Yes | Yes | Yes | Yes | 0 | Motif report |
nametable | Yes | Yes | Yes | No | 0 | Simple table with sequence name |
pir | No | Yes | No | No | 0 | PIR feature format |
regions | Yes | Yes | Yes | No | 0 | Annotated sequence regions |
seqtable | Yes | Yes | Yes | Yes | 0 | Simple table with sequence on each line |
srs | Yes | Yes | Yes | No | 0 | Simple report format for SRS |
swiss | No | Yes | No | No | 0 | Swissprot feature format |
table | Yes | Yes | Yes | No | 0 | Simple table |
tagseq | Yes | Yes | Yes | No | 0 | Sequence with features marked below |
A report format defines the permitted layout and content of text in a file. This includes required text tokens and formatting conventions. Typically, the reports are used to describe features, usually short sequence motifs, in a sequence. The motif sequences, sequence identifier codes, sequence position numbers and the corresponding features are given in some form.
All of the report formats, excluding TAB-delimited (excel) format and those that are also standard sequence feature table formats (EMBL, GenBank etc) have a block of reference information at the start of the report. This gives the program name, date the report was generated, the output file name, the ID name of the sequence, the region in the sequence in which features are being reported and some of the parameters and statistics of the report. For example:
######################################## # Program: garnier # Rundate: Fri 20 Feb 2009 11:16:24 # Commandline: garnier # [-sequence] uniprot:Q62671 # -rformat seqtable # Report_format: seqtable # Report_file: ubr5_rat.garnier ######################################## #======================================= # # Sequence: UBR5_RAT from: 1 to: 920 # HitCount: 214 # # DCH = 0, DCS = 0 # # Please cite: # Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120 # # #=======================================
The report data then follows, for example:
Start End garnier Sequence 1 18 H ARRERMTAREEASLRTLE 19 20 T GR 21 24 H RRAT 25 25 C L 26 26 E L 27 27 C S 28 30 H ARQ 31 31 C G 32 35 E MMSA 36 38 T RGD 39 48 H FLNYALSLMR 49 51 C SHN 52 56 T DEHSD 57 59 E VLP
The data are taken from the report feature table. Different report formats show different tags from the feature table. The tags that might be reported are as follows:
start
Start position of the match in a sequence
end
End position of the match in a sequence
length
Length of the match in a sequence
name
Name of the sequence
sequence
Sequence of the match
strand
DNA strand in which the feature occurs
tagname
Printable name of the feature
tagvalue
Feature in a sequence
type
Type of a feature
Finally, there is a block of information at the end of the report with summary information. The contents vary depending on the application. For example:
#--------------------------------------- # # Residue totals: H:383 E:154 T:195 C:188 # percent: H: 42.4 E: 17.0 T: 21.6 C: 20.8 # #--------------------------------------- #--------------------------------------- # Total_sequences: 1 # Total_length: 920 # Reported_sequences: 1 # Reported_hitcount: 214 #---------------------------------------
Reports are referred to on the EMBOSS command line exactly like a standard output file, i.e. by naming the report file explicitly.
There are a set of command line qualifiers that are used for report output (see Section 6.4, “Datatype-specific Command Line Qualifiers”). These allow you to set, for example, the name and format of the file and to control what data are written to the report.
For example, to set the output of the garnier program to gff:
garnier -rformat gff |
EMBOSS programs that use reports for their output are summarised in the table below (???).
Application | Description |
---|---|
antigenic | Finds antigenic sites in proteins |
dan | Calculates DNA RNA/DNA melting temperature |
diffseq | Find differences between nearly identical sequences |
digest | Protein proteolytic enzyme or reagent cleavage digest |
dreg | Regular expression search of a nucleotide sequence |
equicktandem | Finds tandem repeats |
etandem | Looks for tandem repeats in a nucleotide sequence |
fuzznuc | Nucleic acid pattern search |
fuzzpro | Protein pattern search |
fuzztran | Protein pattern search after translation |
garnier | Predicts protein secondary structure |
helixturnhelix | Report nucleic acid binding motifs |
marscan | Finds MAR/SAR sites in nucleic sequences |
patmatdb | Search a protein sequence with a motif |
patmatmotifs | Search a PROSITE motif database with a protein sequence |
preg | Regular expression search of a protein sequence |
psiphi | Phi and psi torsion angles from protein coordinates |
recoder | Remove restriction sites but maintain same translation |
restrict | Finds restriction enzyme cleavage sites |
sigcleave | Reports protein signal cleavage sites |
silent | Silent mutation restriction enzyme scan |
sirna | Finds siRNA duplexes in mRNA |
tcode | Fickett TESTCODE statistic to identify protein coding DNA |
tmap | Displays membrane spanning regions |
twofeat | Finds neighbouring pairs of features in sequences |