5.1. Introduction

Many EMBOSS applications have functionality in common. For instance, the majority read or write molecular sequences, which may or may not have features associated with them or might be aligned together in a file. Software and database providers have tended to define their own file formats leading, sadly, to the large number of different formats for sequences, features and alignments in common use. This can be confusing, but EMBOSS makes life easier by supporting different formats in a consistent way for all applications in the package. The applications use the same formats and are configurable in a similar manner, maintaining a consistent look and feel.

EMBOSS automatically recognises most common formats on input. Most formats can be generated as output too. This makes the transition from, and interoperation with, other sequence analysis packages easy. If, for example, your sequences are in GCG format, you will have no problem reading them in. If, however, your sequence is not in a recognised standard format, you will not be able to analyse your sequence easily with EMBOSS (or anything else); you will first need to convert it to a supported format.

Most sequence formats are handled by EMBOSS automatically. When reading a sequence, EMBOSS will deduce the format by trying all supported formats until one succeeds. A few formats are not tested automatically by default but are available on request. For efficiency, or in the few cases where automatic detection isn't possible, you can directly specify the format of an input file. When writing a sequence, EMBOSS will use FASTA format by default, or you can specify a format to use. EMBOSS uses the powerful and flexible Uniform Sequence Address (USA) mechanism for specifying sequences. The USA (Section 6.6, “The Uniform Sequence Address (USA)”) allows you to completely specify a data source, one or more sequences and sequence-associated data (including file format) on the command line.

When reading or writing features associated with a sequence, a standard set of feature formats are used. Features can be generated in a feature table attached to sequence data in a file in a standard sequence format or as a raw feature table, i.e. one without the associated sequence. EMBOSS uses the Uniform Feature Object (UFO) mechanism (Section 6.7, “The Uniform Feature Object (UFO)”) for specifying features. Similar to the USA, the UFO allows you to completely specify a data source and format on the command line.

When writing an alignment of two or more sequences, EMBOSS uses a standard set of alignment formats. These include all the formats that are in popular use and some variants on these.

EMBOSS is moving towards using a set of standard formats for all application input and output files. This will improve the convenience and interoperation of the applications. You can specify the report format to be used on the command line.