5.5. Introduction to Report Formats

5.5. Introduction to Report Formats
Prev	Chapter 5. File Formats	Next

5.5.1. What is a Report Format?

A very wide variety of data formats are currently used. Often these share common records in the data reported, but format differences complicate their comparison. The interoperation of different programs requires the output from one application to be reformatted so that it's suitable for input to another application. This requires programming skills that a typical user of EMBOSS does not possess, or otherwise must be done manually which is very time consuming. EMBOSS aims to mitigate these difficulties by using a standard set of report formats for application output where possible.

Report formats have a consistent look and feel making it easier to compare the output from different programs. It is often convenient to have different report formats produced by the same program for different purposes. Depending on what you want to do with the result, it might be better to have a human readable report for publication purposes, or a more terse (and less-readable) report for input into another program for further analysis.

5.5.2. Supported Report Formats

Report formats take their origin from the need to deal with EMBL, GenBank and PIR feature tables. It was therefore a natural choice to extend these to cope with other output data. All of the standard sequence feature tables (Section A.2, “Supported Feature Formats”) are also report formats. The format include:

EMBL
GenBank
GFF
PIR
SwissProt

There are other formats to cater for more than standard sequence information. The format name is the same as the program that first generated that style of output, for example "dbmotif", "diffseq" and so on. The output options are very flexible. For descriptions and examples of the supported formats see Section A.4, “Supported Report Formats”.

The supported report formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Header (whether the standard EMBOSS report header is included), Seq (whether the sequence corresponding to the features is included), Tags (number of specific tag-values reported. A non-zero value suggests a format is not suitable for application output that does not generate these specific tags.) Description (short description of the format).

Table 5.6. Report formats
Output Format	Nuc	Pro	Header	Seq	Tags	Description
dasgff	Yes	Yes	No	No	0	DAS GFF feature format
dbmotif	Yes	Yes	Yes	Yes	0	Motif database hits
debug	Yes	Yes	Yes	Yes	0	Debugging trace of full internal data content
diffseq	Yes	Yes	Yes	Yes	7	Differences between a pair of sequences
embl	Yes	No	No	No	0	EMBL feature format
excel	Yes	Yes	No	No	0	Tab-delimited file for import to Microsoft Excel
feattable	Yes	Yes	No	No	0	EMBL format feature table with internal tags
genbank	Yes	No	No	No	0	Genbank feature format
gff	Yes	Yes	No	No	0	GFF feature format
listfile	Yes	Yes	Yes	No	0	EMBOSS list file of sequence USAs with ranges
motif	Yes	Yes	Yes	Yes	0	Motif report
nametable	Yes	Yes	Yes	No	0	Simple table with sequence name
pir	No	Yes	No	No	0	PIR feature format
regions	Yes	Yes	Yes	No	0	Annotated sequence regions
seqtable	Yes	Yes	Yes	Yes	0	Simple table with sequence on each line
srs	Yes	Yes	Yes	No	0	Simple report format for SRS
swiss	No	Yes	No	No	0	Swissprot feature format
table	Yes	Yes	Yes	No	0	Simple table
tagseq	Yes	Yes	Yes	No	0	Sequence with features marked below

5.5.3. Inside a Report

A report format defines the permitted layout and content of text in a file. This includes required text tokens and formatting conventions. Typically, the reports are used to describe features, usually short sequence motifs, in a sequence. The motif sequences, sequence identifier codes, sequence position numbers and the corresponding features are given in some form.

All of the report formats, excluding TAB-delimited (excel) format and those that are also standard sequence feature table formats (EMBL, GenBank etc) have a block of reference information at the start of the report. This gives the program name, date the report was generated, the output file name, the ID name of the sequence, the region in the sequence in which features are being reported and some of the parameters and statistics of the report. For example:

########################################
# Program: garnier
# Rundate: Fri 20 Feb 2009 11:16:24
# Commandline: garnier
#    [-sequence] uniprot:Q62671
#    -rformat seqtable
# Report_format: seqtable
# Report_file: ubr5_rat.garnier
########################################

#=======================================
#
# Sequence: UBR5_RAT     from: 1   to: 920
# HitCount: 214
#
# DCH = 0, DCS = 0
# 
#  Please cite:
#  Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120
# 
#
#=======================================

The report data then follows, for example:

  Start     End garnier Sequence
      1      18 H       ARRERMTAREEASLRTLE
     19      20 T       GR
     21      24 H       RRAT
     25      25 C       L
     26      26 E       L
     27      27 C       S
     28      30 H       ARQ
     31      31 C       G
     32      35 E       MMSA
     36      38 T       RGD
     39      48 H       FLNYALSLMR
     49      51 C       SHN
     52      56 T       DEHSD
     57      59 E       VLP

The data are taken from the report feature table. Different report formats show different tags from the feature table. The tags that might be reported are as follows:

start: Start position of the match in a sequence
end: End position of the match in a sequence
length: Length of the match in a sequence
name: Name of the sequence
sequence: Sequence of the match
strand: DNA strand in which the feature occurs
tagname: Printable name of the feature
tagvalue: Feature in a sequence
type: Type of a feature

Finally, there is a block of information at the end of the report with summary information. The contents vary depending on the application. For example:

#---------------------------------------
#
#  Residue totals: H:383   E:154   T:195   C:188
#         percent: H: 42.4 E: 17.0 T: 21.6 C: 20.8
#
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 920
# Reported_sequences: 1
# Reported_hitcount: 214
#---------------------------------------

5.5.4. Specifying Reports on the Command line

Reports are referred to on the EMBOSS command line exactly like a standard output file, i.e. by naming the report file explicitly.

There are a set of command line qualifiers that are used for report output (see Section 6.4, “Datatype-specific Command Line Qualifiers”). These allow you to set, for example, the name and format of the file and to control what data are written to the report.

For example, to set the output of the garnier program to gff:

garnier -rformat gff

5.5.5. Applications that use Reports

EMBOSS programs that use reports for their output are summarised in the table below (???).

Applications that use Reports

Application	Description
antigenic	Finds antigenic sites in proteins
dan	Calculates DNA RNA/DNA melting temperature
diffseq	Find differences between nearly identical sequences
digest	Protein proteolytic enzyme or reagent cleavage digest
dreg	Regular expression search of a nucleotide sequence
equicktandem	Finds tandem repeats
etandem	Looks for tandem repeats in a nucleotide sequence
fuzznuc	Nucleic acid pattern search
fuzzpro	Protein pattern search
fuzztran	Protein pattern search after translation
garnier	Predicts protein secondary structure
helixturnhelix	Report nucleic acid binding motifs
marscan	Finds MAR/SAR sites in nucleic sequences
patmatdb	Search a protein sequence with a motif
patmatmotifs	Search a PROSITE motif database with a protein sequence
preg	Regular expression search of a protein sequence
psiphi	Phi and psi torsion angles from protein coordinates
recoder	Remove restriction sites but maintain same translation
restrict	Finds restriction enzyme cleavage sites
sigcleave	Reports protein signal cleavage sites
silent	Silent mutation restriction enzyme scan
sirna	Finds siRNA duplexes in mRNA
tcode	Fickett TESTCODE statistic to identify protein coding DNA
tmap	Displays membrane spanning regions
twofeat	Finds neighbouring pairs of features in sequences

Prev	Up	Next
5.4. Introduction to Alignment Formats	Home	Chapter 6. The EMBOSS Command Line