6.11. Handling Alignments

6.11.1. Introduction

EMBOSS supports most of the common sequence alignment formats for input and output (see the EMBOSS Users Guide for a complete list). Alignments to be read or written are defined in the application ACD file although it is possible to create alignment objects directly if this is required.

Most of the alignment formats can include a standard header (given at the start of the alignment file) and in some cases a tail (given at the end) which gives information including the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment. There is also a subheader and subtail used for additional comments and annotation.

The tail section is used by some applications (e.g. merger in EMBOSS) to report special features of the alignment.

All alignments have certain basic properties:

  • Type of sequences (protein or nucleotide)

  • Alignment format

  • Number of sequences in the alignment

  • Minimum permissible number of sequences

  • Maximum permissible number of sequences

  • Output width

In addition there is often associated data including:

  • An integer or floating point matrix

  • Name of matrix

  • Gap insertion penalty

  • Gap extension penalty

For global alignments the full sequence or just the matching regions can be displayed. Optionally the alignment can include the accession number, sequence description and full USA of the aligned sequences.

The options above are usually set in the application ACD file (via attributes of the data definition) or on the command line (via qualifiers that are specific to alignments). For a description of these attributes and qualifiers see Section A.5, “Datatype-specific Attributes”

Functions are provided to set these directly in case this is required.

Functions for manipulating alignments are organised into four groups:

  • Writing alignments

  • Retrieving elements of an alignment object

  • Setting elements of an alignment object

  • Miscellaneous functions

6.11.2. AJAX Library Files

AJAX library files for handling alignments are listed in the table (Table 6.17, “AJAX Library Files for Handling Alignments”). Library file documentation, including a complete description of datatypes and functions, is available at:

http://emboss.open-bio.org/rel/dev/libs/
Table 6.17. AJAX Library Files for Handling Alignments
Library File DocumentationDescription
ajalignFunctions for handling sequence alignments
ajseqGeneral sequence handling

ajalign.h/cDefines the main alignment object (AjPAlign). It can be used for retrieving an input sequence alignment via ACD file processing. The header file contains most of the functions you will ever need for general handling of sequence alignments. It includes static datatypes and functions for handling alignments at a low level. You are unlikely to need the latter unless you plan to implement code to support new alignment formats. For advice on how to do this ask the EMBOSS developers.

ajseq.h/cDefines the AjPSeqset object. This is a set of sequences for general use and is used for handling input alignments from ACD files. ajseq.h/c contain extensive functions for handling sequence sets and thereby for general manipulations of sequence alignments.

6.11.3. ACD Datatypes

Alignment input is handled as a special case of general sequence set input. The seqset ACD datatype is used:

seqset

Read multiple sequences as a single set.

The ACD data definition must include the aligned: attribute (see below).

There is a dedicated datatype for output alignments:

align

Alignment output.

6.11.4. ACD Data Definition

A typical ACD definition for an input alignment:

# multiple sequence input read as a single aligned set
seqset: sequence  
[
    parameter: "Y"
    type:      "protein"
    aligned:   "Y"
]

A typical ACD definition for an output alignment:

align: outfile 
[
    parameter: "Y"
    aformat:   "srspair"
    type:      "protein"
    minseqs:   "2"
    maxseqs:   "2"
    aglobal:   "Y"
]

6.11.4.1. Parameter Name

All data definitions for alignment input and output should have standard parameter names. These include:

  • sequence for any aligned input sequences

  • outfile for alignment output

  • Alternatives and variations (e.g. afile, bfile for multiple outputs are allowed)

For more information see Appendix A, ACD Syntax Reference.

6.11.4.2. Common Attributes

Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).

parameter: Alignments are typically the primary input or output of an EMBOSS application and, as such, should be defined as parameters by using the global attribute parameter: "Y".

type: Specifies the type of the sequences in the input or output alignment and is used for validation purposes. See:

http://emboss.open-bio.org/rel/dev/libs/ACDSyntaxSequenceTypes

aligned: Any seqset or seqsetall datatype must have the aligned: attribute set to indicate whether the sequences are aligned or not.

aformat: The output format is normally set at the command line but a default may be hard-coded with aformat:. All common alignment formats are supported (see the EMBOSS Users Guide).

minseqs: Specifies the minimum number of expected sequences and is used for validation of output.

maxseqs: Specifies the maximum number of expected sequences and is used for validation of output.

aglobal: A boolean attribute which is set to "Y" if the output can contain more than one alignment from the same input.

6.11.5. AJAX Datatypes

For handling alignments, including input alignments defined in the ACD file, use:

AjPSeqset

Input alignment.

For handling output alignments defined in the ACD file use:

AjPAlign

Output alignment.

6.11.6. ACD File Handling

Datatypes and functions for handling alignments via the ACD file are shown below (Table 6.18, “Datatypes and Functions for Alignment Input and Output”).

Table 6.18. Datatypes and Functions for Alignment Input and Output
To read an alignmentTo write an alignment
ACD datatypeseqsetalign
AJAX datatypeAjPSeqsetAjPAlign
To retrieve from ACDajAcdGetSeqsetajAcdGetAlign

Your application code will call embInit to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet* family of functions which return pointers to appropriate objects.

6.11.6.1. Input Alignment Retrieval

To retrieve an input alignment an object pointer is declared and then initialised using ajAcdGetSeqset:

    AjPSeqset seqset=NULL;

    seqset = ajAcdGetSeqset("sequence");

6.11.6.2. Output Alignment Retrieval

To retrieve an output alignment stream an object pointer is declared and initialised using ajAcdGetAlign:

    AjPAlign outfile=NULL;

    outfile = ajAcdGetAlign("outfile");

6.11.6.3. Memory and File Management

It is your responsibility to close any files and free up memory at the end of the program.

6.11.6.3.1. Closing Output Alignment Files

To close an output alignment stream the AJAX function ajAlignClose is used:

    ajAlignClose (outfile);
6.11.6.3.2. Freeing Memory

You must call the default destructor function (see below) on any objects returned by calls to ajAcdGet*.

Additionally you must call ajAlignExit to clean up internal memory allocated for housekeeping of alignment processing:

    ajAlignExit();

6.11.7. Alignment Object Memory Management

6.11.7.1. Default Object Construction

Alignment output objects are typically loaded from ACD file processing (see above). In the unlikely event that you need to create one manually you can use the default alignment object constructor ajAlignNew. All constructors return the address of a new object. In the following code the pointer does not need to be initialised to NULL but it is good practice to do so:

    AjPAlign       align = NULL;

    align = ajAlignNew();
    /* The object is instantiated and ready for use */

6.11.7.2. Default Object Destruction

You must free the memory for an object, once you are finished with it. The default destructor function is:

void         ajAlignDel (AjPAlign* pthys); /* Destructor for Alignment objects */

It is used as follows:

    AjPAlign  align=NULL;

    align = ajAcdGetAlign("align");

    /* Do something with alignment */

    ajAlignDel(&align);

6.11.7.3. Alternative Object Construction and Loading

Applications that create alignment outputs usually generate aligned sequences which are then used to populate the alignment object. The following functions are available:

/* Defines a sequence set as an alignment. The sequences are stored internally and may be edited by alignment processing. */
AjBool  ajAlignDefine (AjPAlign pthys, AjPSeqset seqset);    

/* Defines a sequence pair as an alignment. The sequences are stored internally and may be edited by alignment processing. */
AjBool  ajAlignDefineSS (AjPAlign pthys,                                     
                         AjPSeq seqa, AjPSeq seqb); 

/* Defines a pair of char* strings as an alignment (names of sequences are also required) */
AjBool  ajAlignDefineCC (AjPAlign pthys,                        
                         const char* seqa, const char* seqb,                
                         const char* namea,const  char* nameb);

6.11.8. Writing Alignments

There are several AJAX functions for writing out alignment information. Applicatons will usually create an alignment object through ACD processing, populate it with aligned sequences (see above) and call ajAlignWrite.

/* Writes an alignment file */
void  ajAlignWrite (AjPAlign thys);                  

/* Reset to allow resue of Alignment objects */
void  ajAlignReset (AjPAlign thys);                  

/* Opens a new align file. Called bvy ACD processing*/
AjBool  ajAlignOpen (AjPAlign thys, const AjPStr name);

/* Writes an alignment header. Called by ajAlignWrite */
void  ajAlignWriteHeader (AjPAlign thys);            

/* Writes an alignment tail Called by ajAlignWrite */
void  ajAlignWriteTail (AjPAlign thys);              

/* Sets the default format for an alignment to 'gff' if not already defined */
AjBool  ajAlignFormatDefault (AjPStr* pformat);

6.11.9. Retrieving Elements of an Alignment Object

Alignment object elements rarely need to be examined by the programmer. Functions are available to retrieve internal values.

/* Returns the filename for an alignment. If the alignment has more than one subalignment, returns the total. */
ajint  ajAlignGetLen (const AjPAlign thys);                          
  
/* Returns the filename */
const char*  ajAlignGetFilename (const AjPAlign thys);       
                
/* Returns the sequence format */
const AjPStr  ajAlignGetFormat (const AjPAlign thys);

6.11.10. Setting Elements of an Alignment Object

Alignment objects have elements that are used to populate the header and tail sections of the output (where the output format can include such extra detail).

Elements that can be set in the alignment header include:

  • Gap penalties

  • Matrix name

  • Alignment score

Functions for this are below. Note the matrix name can be set directly or from a matrix object:

/* Setting elements of the alignment header */
void  ajAlignSetGapI (AjPAlign thys, ajint gappen, ajint extpen);
void  ajAlignSetGapR (AjPAlign thys, float gappen, float extpen);
void  ajAlignSetMatrixName (AjPAlign thys, const AjPStr matrix);
void  ajAlignSetMatrixNameC (AjPAlign thys, const char* matrix);
void  ajAlignSetMatrixInt (AjPAlign thys, AjPMatrix matrix);
void  ajAlignSetMatrixFloat (AjPAlign thys, AjPMatrixf matrix);
void  ajAlignSetScoreI (AjPAlign thys, ajint score);
void  ajAlignSetScoreL (AjPAlign thys, ajlong score);
void  ajAlignSetScoreR (AjPAlign thys, float score);

The standard properties in alignment subheader are:

  • Length

  • Identity

  • Gaps

  • Similarity

  • Score

The function to set these is:

void  ajAlignSetSubStandard (AjPAlign thys, ajint iali);

Alternatively, you may set these manually using:

void  ajAlignSetStats (AjPAlign thys, ajint iali, ajint len,
                       ajint ident, ajint sim, ajint gaps,
                       const AjPStr score);

The header section can include an optional comment. Similarly, the tail section is free text available to report any special notes on the alignment. Comments can be set in the (sub)header and (sub)tail, or appended or prepended too, using the following functions:

/* Setting comments of the alignment (sub)header and (sub)tail */
void  ajAlignSetHeader (AjPAlign thys, const AjPStr header);
void  ajAlignSetHeaderApp (AjPAlign thys, const AjPStr header);
void  ajAlignSetHeaderC (AjPAlign thys, const char* header);
void  ajAlignSetSubHeader (AjPAlign thys, const AjPStr subheader);
void  ajAlignSetSubHeaderApp (AjPAlign thys, const AjPStr subheader);
void  ajAlignSetSubHeaderC (AjPAlign thys, const char* subheader);
void  ajAlignSetSubHeaderPre (AjPAlign thys, const AjPStr subheader);
void  ajAlignSetSubTail(AjPAlign thys, const AjPStr tail);
void  ajAlignSetSubTailC(AjPAlign thys, const char* tail);
void  ajAlignSetSubTailApp(AjPAlign thys, const AjPStr tail);
void  ajAlignSetTail (AjPAlign thys, const AjPStr tail);
void  ajAlignSetTailApp (AjPAlign thys, const AjPStr tail);
void  ajAlignSetTailC (AjPAlign thys, const char* tail);

The alignment type (protein or nucleic) may be set directly (if it is not set already):

void  ajAlignSetType (AjPAlign thys);

A range (or sub-ranges) of sequences to output can be set using:

/* Setting sequence range to output */
AjBool  ajAlignSetRange (AjPAlign thys,
                         ajint start1, ajint end1,
                         ajint len1, ajint off1,
                         ajint start2, ajint end2,
                         ajint len2, ajint off2);

AjBool  ajAlignSetSubRange (AjPAlign thys,
                            ajint substart1, ajint start1,
                            ajint end1, AjBool rev1, ajint len1,
                            ajint substart2, ajint start2,
                            ajint end2, AjBool rev2, ajint len2);

Finally, there is a function to set the alignment object to use external sequence references, which are references (copied pointers) rather than clones of the actual sequences:

/* Setting properties of alignment object */
void  ajAlignSetExternal (AjPAlign thys, AjBool external);

This is intended for alignments of large sequences where it is undesirable to keep many copies.

6.11.11. Miscellaneous Functions

The ajAlignConsStats function calculates a consensus sequence and statistics (percent identity and similarity and alignment length) for a multiple alignment. ajAlignFindFormatis used in ACD processing to match the specified alignment format to the internal list of known formats.

AjBool  ajAlignConsStats (const AjPSeqset thys, AjPMatrix mymatrix,
                          AjPStr *cons, ajint* retident, 
                          ajint* retsim, ajint* retgap,
                          ajint* retlen);

AjBool  ajAlignFindFormat (const AjPStr format, ajint* iformat);