5.5. Introduction to Report Formats

5.5.1. What is a Report Format?

A very wide variety of data formats are currently used. Often these share common records in the data reported, but format differences complicate their comparison. The interoperation of different programs requires the output from one application to be reformatted so that it's suitable for input to another application. This requires programming skills that a typical user of EMBOSS does not possess, or otherwise must be done manually which is very time consuming. EMBOSS aims to mitigate these difficulties by using a standard set of report formats for application output where possible.

Report formats have a consistent look and feel making it easier to compare the output from different programs. It is often convenient to have different report formats produced by the same program for different purposes. Depending on what you want to do with the result, it might be better to have a human readable report for publication purposes, or a more terse (and less-readable) report for input into another program for further analysis.

5.5.2. Supported Report Formats

Report formats take their origin from the need to deal with EMBL, GenBank and PIR feature tables. It was therefore a natural choice to extend these to cope with other output data. All of the standard sequence feature tables (Section A.2, “Supported Feature Formats”) are also report formats. The format include:

  • EMBL

  • GenBank

  • GFF

  • PIR

  • SwissProt

There are other formats to cater for more than standard sequence information. The format name is the same as the program that first generated that style of output, for example "dbmotif", "diffseq" and so on. The output options are very flexible. For descriptions and examples of the supported formats see Section A.4, “Supported Report Formats”.

The supported report formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Header (whether the standard EMBOSS report header is included), Seq (whether the sequence corresponding to the features is included), Tags (number of specific tag-values reported. A non-zero value suggests a format is not suitable for application output that does not generate these specific tags.) Description (short description of the format).

Table 5.6. Report formats
Output FormatNucProHeaderSeqTagsDescription
dasgffYesYesNoNo0DAS GFF feature format
dbmotifYesYesYesYes0Motif database hits
debugYesYesYesYes0Debugging trace of full internal data content
diffseqYesYesYesYes7Differences between a pair of sequences
emblYesNoNoNo0EMBL feature format
excelYesYesNoNo0Tab-delimited file for import to Microsoft Excel
feattableYesYesNoNo0EMBL format feature table with internal tags
genbankYesNoNoNo0Genbank feature format
gffYesYesNoNo0GFF feature format
listfileYesYesYesNo0EMBOSS list file of sequence USAs with ranges
motifYesYesYesYes0Motif report
nametableYesYesYesNo0Simple table with sequence name
pirNoYesNoNo0PIR feature format
regionsYesYesYesNo0Annotated sequence regions
seqtableYesYesYesYes0Simple table with sequence on each line
srsYesYesYesNo0Simple report format for SRS
swissNoYesNoNo0Swissprot feature format
tableYesYesYesNo0Simple table
tagseqYesYesYesNo0Sequence with features marked below

5.5.3. Inside a Report

A report format defines the permitted layout and content of text in a file. This includes required text tokens and formatting conventions. Typically, the reports are used to describe features, usually short sequence motifs, in a sequence. The motif sequences, sequence identifier codes, sequence position numbers and the corresponding features are given in some form.

All of the report formats, excluding TAB-delimited (excel) format and those that are also standard sequence feature table formats (EMBL, GenBank etc) have a block of reference information at the start of the report. This gives the program name, date the report was generated, the output file name, the ID name of the sequence, the region in the sequence in which features are being reported and some of the parameters and statistics of the report. For example:

########################################
# Program: garnier
# Rundate: Fri 20 Feb 2009 11:16:24
# Commandline: garnier
#    [-sequence] uniprot:Q62671
#    -rformat seqtable
# Report_format: seqtable
# Report_file: ubr5_rat.garnier
########################################

#=======================================
#
# Sequence: UBR5_RAT     from: 1   to: 920
# HitCount: 214
#
# DCH = 0, DCS = 0
# 
#  Please cite:
#  Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120
# 
#
#=======================================

The report data then follows, for example:

  Start     End garnier Sequence
      1      18 H       ARRERMTAREEASLRTLE
     19      20 T       GR
     21      24 H       RRAT
     25      25 C       L
     26      26 E       L
     27      27 C       S
     28      30 H       ARQ
     31      31 C       G
     32      35 E       MMSA
     36      38 T       RGD
     39      48 H       FLNYALSLMR
     49      51 C       SHN
     52      56 T       DEHSD
     57      59 E       VLP

The data are taken from the report feature table. Different report formats show different tags from the feature table. The tags that might be reported are as follows:

start

Start position of the match in a sequence

end

End position of the match in a sequence

length

Length of the match in a sequence

name

Name of the sequence

sequence

Sequence of the match

strand

DNA strand in which the feature occurs

tagname

Printable name of the feature

tagvalue

Feature in a sequence

type

Type of a feature

Finally, there is a block of information at the end of the report with summary information. The contents vary depending on the application. For example:

#---------------------------------------
#
#  Residue totals: H:383   E:154   T:195   C:188
#         percent: H: 42.4 E: 17.0 T: 21.6 C: 20.8
#
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 920
# Reported_sequences: 1
# Reported_hitcount: 214
#---------------------------------------

5.5.4. Specifying Reports on the Command line

Reports are referred to on the EMBOSS command line exactly like a standard output file, i.e. by naming the report file explicitly.

There are a set of command line qualifiers that are used for report output (see Section 6.4, “Datatype-specific Command Line Qualifiers”). These allow you to set, for example, the name and format of the file and to control what data are written to the report.

For example, to set the output of the garnier program to gff:

garnier -rformat gff

5.5.5. Applications that use Reports

EMBOSS programs that use reports for their output are summarised in the table below (???).

Applications that use Reports

ApplicationDescription
antigenicFinds antigenic sites in proteins
danCalculates DNA RNA/DNA melting temperature
diffseqFind differences between nearly identical sequences
digestProtein proteolytic enzyme or reagent cleavage digest
dregRegular expression search of a nucleotide sequence
equicktandemFinds tandem repeats
etandemLooks for tandem repeats in a nucleotide sequence
fuzznucNucleic acid pattern search
fuzzproProtein pattern search
fuzztranProtein pattern search after translation
garnierPredicts protein secondary structure
helixturnhelixReport nucleic acid binding motifs
marscanFinds MAR/SAR sites in nucleic sequences
patmatdbSearch a protein sequence with a motif
patmatmotifsSearch a PROSITE motif database with a protein sequence
pregRegular expression search of a protein sequence
psiphiPhi and psi torsion angles from protein coordinates
recoderRemove restriction sites but maintain same translation
restrictFinds restriction enzyme cleavage sites
sigcleaveReports protein signal cleavage sites
silentSilent mutation restriction enzyme scan
sirnaFinds siRNA duplexes in mRNA
tcodeFickett TESTCODE statistic to identify protein coding DNA
tmapDisplays membrane spanning regions
twofeatFinds neighbouring pairs of features in sequences