6.7. Handling Sequences

6.7. Handling Sequences
Prev	Chapter 6. Programming with AJAX	Next

6.7.1. Introduction

There are a wide variety of sequence formats currently in use. They define the permitted layout and content of text in a file, including required text tokens that are used to define the fields in the databank. These fields include:

The sequence identifier name
Accession number
Bibliographic information
Sequence annotation including a description of any features
The sequence itself
Molecular features associated with the sequence, such as protein secondary structure or molecular recognition sites, are kept in a feature table.

The bibliographic information includes data such as:

Literature references
Experimental details
Author contact information
Cross-links to other databases
Descriptions, annotations and comments around the sequence
Much more besides!

An entry in a sequence databank will usually include a code to uniquely identify an entry in the database. The code is often an accession number and/or identifier name (ID). An accession number is a unique alphanumeric identifier guaranteed to stay with that sequence for the life of the database. If two sequences are merged into one then the new sequence will get a new accession number; the accession numbers of the merged sequences will be retained as 'secondary' accession numbers. EMBL, GenBank and SwissProt share an accession numbering scheme, so a given accession number uniquely identifies a sequence within these three databases. ID names, in contrast, are not guaranteed to remain the same between different versions of a database, although in practice they usually do.

FASTA is the most commonly used format, followed by those used in the major sequence databases. EMBOSS recognises nearly all of the known formats for input, and most can be used for output too. Sequences are referred to on the EMBOSS command line by their Uniform Sequence Address or USA (see the EMBOSS Users Guide). This is a standard sequence naming scheme used by all EMBOSS applications. A USA specifies one or more sequences that might be read from or written to a file or a larger databank. Other sequence sources such as applications or a web server can also be specified.

There is also a set of command line qualifiers that are used for sequence input and output. They allow you to set such things as file format, sequence regions, database and entry names etc. Such qualifiers (Section A.5, “Datatype-specific Attributes”) may also be "hard-coded" as attributes in the ACD file.

As you would expect, AJAX includes very well developed functionality for handling sequences. This includes:

Recovering or setting elements in the sequence objects.
Testing sequence properties, for instance whether a sequence is protein or nucleotide.
Calculating sequence properties, for example base composition of a nucleotide sequence or the molecular weight of a protein sequence.
Interconversion of sequences and strings.
Processing sequences, for instance trimming it to defined regions or removing gap characters.
Sequence type validation; whether a sequence is protein or nucleic acid, contains gaps and so on. This is usually handled at the level of the ACD file but can be done directly.
Conversion of sequence to numerical form for convenient processing, for example as an index for a scoring matrix.
Translation of a genetic sequence into a protein.
Interconversion of nucleotide bases and amino acid codes.
Calculating the properties of nucleotide sequences, for example DNA melting and annealing temperatures and energies.

Application input and output sequences are typically defined in the ACD file and retrieved by appropriate calls to ajAcdGet* functions. There are ACD datatypes for input of single sequences, sequential access to multiple sequences or multiple sequences in one or more sets. There are corresponding datatypes for sequence output. Sequence objects can also be created directly if that's required. Sequences are held internally as strings and required memory is dynamically reallocated as needed. This means that there are no arbitrary limits when using a function with a sequence object.

6.7.2. AJAX Library Files

AJAX library files for handling sequences are listed in the table (Table 6.9, “AJAX Library Files for Handling Sequences”). Library file documentation, including a complete description of datatypes and functions, is available at:

http://emboss.open-bio.org/rel/dev/libs/

Table 6.9. AJAX Library Files for Handling Sequences
Library File Documentation	Description
ajseq	General sequence handling
ajseqdata	Sequence datatypes
ajseqread	Sequence reading
ajseqwrite	Sequence writing
ajseqdb	Sequence database access methods
ajseqabi	Sequence parsing from ABI trace file
ajseqtype	Validation and processing for sequences types
ajtranslate	Sequence translation
ajbase	Base conversion functions
ajdan	Nucleic acid functions

ajseq.h/c. Most of the functions you will ever need for general sequence handling. They define the AjPSeqCvt object used for sequence conversion (Section 6.7.14, “Sequence Conversion”.

ajseqdata.h/c. Define the basic sequence objects (AjPSeq, AjPSeqset and AjPSeqall) for general use e.g. retrieving sequences via ACD file processing. They also define a sequence input object (AjPSeqin) and a sequence query object (AjPSeqQuery) used for low level sequence retrieval.

ajseqread.h/c. Functions for handling the sequence query and sequence input objects and a few general sequence handling functions. They also define static data structures and functions for handling sequence input from the supported formats at a low level.

ajseqwrite.h/c. Defines the sequence output object (AjPSeqout) and include functions for writing sequences to file. They define static data structures and functions for handling sequence output, in the supported formats, at a low level.

ajseqdb.h/c. Defines a sequence database reading object (SeqSAccess) used in the definition of AjPSeqin and static data structures and functions for handling all aspects of sequence database access.

ajseqabi.h/c. Provides functions for parsing sequences (and other data) from ABI trace files. See the library documentation () for further information.

ajseqtype.h/c. Functions for handling of sequence types. They also contain static data structures and functions for validation and processing of sequence types at a low level.

ajtranslate.h/c. Defines a sequence translation object (AjPTrn) and include functions for handling sequence translation (Section 6.8, “Handling Sequence Translation”).

ajbase.h/c. Contains functions for character conversion of nucleotide bases and other character conversion functions for use with molecular sequences. For further information see Section 6.7.16, “Handling IUB Base Codes”.

ajdan.h/c. Contains functions for calculating properties of nucleotide sequences, mostly related to DNA melting/annealing. For further information see Section 6.7.17, “DNA Calculations”.

You are unlikely to need the static data structures and functions in the above files unless you plan to implement code to support new sequence formats for EMBOSS. For advice on how to do this ask the EMBOSS developers.

Detailed usage notes for ajseqdata.h, ajseqread.h/c and ajseqwrite.h/c are not given here. For advice on how to use the datatypes and functions ask the EMBOSS developers.

6.7.3. ACD Datatypes

There are four ACD datatypes and corresponding methods for handling sequence input. Often just a single sequence is required. Where multiple sequences are needed they may be read either sequentially, for example from a database, or read and held in memory as a set, for example from a sequence alignment:

sequence: Read a single sequence.
seqall: Read multiple sequences sequentially.
seqset: Read multiple sequences as a single set.
seqsetall: Read multiple sequences as multiple sets.

There are three datatypes for handling sequence output to match their input equivalents. They have identical behaviour :

seqout: Write a single sequence.
seqoutset: Write multiple sequences as a single set.
seqoutall: Write multiple sequences sequentially, one at a time.

6.7.4. ACD Data Definition

Typical ACD definitions for sequence input:

# single input sequence
sequence: sequence  
[
    parameter: "Y"
    type:      "gapany"
]

# multiple sequence input read sequentially 
seqall: sequence  
[
    parameter: "Y"
    type:      "gapany"
]

# multiple sequence input read as a single set
seqset: sequence  
[
    parameter: "Y"
    type:      "gapany"
    aligned:   "N"
]

# multiple sequence input read as multiple sets
seqsetall: sequence  
[
    parameter: "Y"
    type:      "gapany"
    aligned:   "N"
]

Typical ACD definitions for sequence output:

# single sequence
seqout: outseq 
[
    parameter: "Y"
    type:      "gapany"
]

# multiple sequence output (sequential writing)
seqoutall: outseq 
[
    parameter: "Y"
    type:      "gapany"
]

# multiple sequence output (written as a set)
seqoutset: outseq 
[
    parameter: "Y"
    type:      "gapany"
]

6.7.4.1. Parameter Name

All data definitions for sequence input and output should have standard parameter names. These include:

sequence for any sequence inputs
outseq for any outputs
Alternatives and variations (e.g. asequence and bsequence for multiple inputs are allowed.

See Appendix A, ACD Syntax Reference.

6.7.4.2. Common Attributes

Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).

parameter: Sequences are typically the primary input or output of an EMBOSS application and, as such, should be defined as parameters by using the global attribute parameter: "Y".

type: Specifies the type of the input or output sequence(s) for validation purposes. See Section A.7, “Sequence Types”.

aligned: Any seqset or seqsetall datatype must have the aligned: attribute set to indicate whether the sequences are aligned or not.

features: Sequence features (Section 6.9, “Handling Features”) will (if available) be read or written if the features: ACD attribute is set for the input or output sequence(s).

osformat: FASTA format is used by default for the output sequence(s). The format is normally set at the command line but a default may be hard-coded with osformat:. All common sequence formats are supported (see the EMBOSS Users Guide).

6.7.5. AJAX Datatypes

There are 9 AJAX datatypes for handling sequence. Four of these are for general use including handling sequences defined in the ACD file:

AjPSeq: A single sequence (for sequence ACD datatype).
AjPSeqall: Sequence input stream for sequential access to multiple sequences (for seqall ACD datatype).
AjPSeqset: Multiple sequences in one or more sets (for seqset and seqsetall ACD datatypes).
AjPSeqout: Sequence output stream (for seqout ACD datatype).

Two are for sequence-related operations:

AjPSeqCvt: Used for sequence conversion into numerical form.
AjPTrn: Holds the Genetic Code specification and information needed to translate a nucleotide sequence and find initiation sites.

The rest are for lower level sequence handling beyond that provided by the static datatypes in the various library files:

AjPSeqin: Holds the sequence USA specification and information needed to read the sequence and possible further sequences.
SeqSAccess: Holds information needed to read a sequence from a database: access methods are defined for each known database type and sequences are read from the database using the defined database access function. It is used in the definition of AjPSeqin.
AjPSeqQuery: Used for interpreting the entry specification part of a USA (see the EMBOSS Users Guide).

You are unlikely to need these specialised objects unless you plan to implement code to support new sequence formats for EMBOSS. For advice on how to do this ask the EMBOSS developers.

6.7.6. ACD File Handling

Datatypes and functions for handling sequences via the ACD file are shown below (Table 6.10, “Datatypes and Functions for Sequence Input and Output”).

Table 6.10. Datatypes and Functions for Sequence Input and Output
Sequence Input
	To read a single sequence	To read a set of sequences (e.g. an alignment)	To read sequences sequentially from a database
ACD datatype	`sequence`	`seqall`	`seqset`
AJAX datatype	`AjPSeq`	`AjPSeqall`	`AjPSeqset`
To retrieve from ACD	`ajAcdGetSeq`	`ajAcdGetSeqall`	`ajAcdGetSeqset`
To get individual sequences	`(ajAcdGetSeq)`	`ajSeqallNext`	`ajSeqsetGetSize, ajSeqsetGetseqSeq`
Sequence Output
	To write a single sequence	To write sequences to a database	To write a set of sequences (e.g. an alignment)
ACD datatype	`seqout`	`seqoutall`	`seqoutset`
AJAX datatype	`AjPSeqout`	`AjPSeqout`	`AjPSeqout`
To open the output stream	`ajAcdGetSeqout`	`ajAcdGetSeqoutall`	`ajAcdGetSeqoutset`
To write sequences to file	`ajSeqoutWriteSeq`	`ajSeqoutWriteSeq`	`ajSeqoutWriteSet`

Your application code will call embInit to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet* family of functions which return pointers to appropriate objects.

6.7.6.1. Input Sequence Retrieval

To retrieve an input sequence an object pointer is declared and initialised using the appropriate ajAcdGet* function.

6.7.6.1.1. Single input sequence

    AjPSeq seq = NULL;

    seq = ajAcdGetSeq("sequence");

6.7.6.1.2. Multiple sequential input sequences

A seqall object is really a means to an end of returning individual sequences in a loop. ajSeqallNext is used to iterate through the sequences:

    AjPSeqall seqall = NULL;
    AjPSeq    seq    = NULL;

    seqall = ajAcdGetSeqall("sequence");

    while(ajSeqallNext(seqall, &seq))
    {
        /* Do something with 'seq' */
    }

6.7.6.1.3. Multiple input sequences in a set

    AjPSeqset seqset = NULL;

    seqset = ajAcdGetSeqset("sequence");

6.7.6.1.4. Multiple input sequence in multiple sets

ajAcdGetSeqsetall is used which returns a pointer for an array of seqset objects:

    AjPSeqset *seqset = NULL;

    seqset = ajAcdGetSeqsetall("sequence");

Of course, to access an individual set the array notation is used. Code to achieve the same effect as ajAcdGetSeqsetallSingle is as follows:

    AjPSeqset *seqset = NULL;
    AjPSeqset seqsetFirst=NULL;

    seqset = ajAcdGetSeqsetall("sequence");
    seqsetFirst = seqset[0];  /* Picks up the pointer to the first sequence set. */

6.7.6.2. Output Sequence Stream Retrieval

To retrieve an output sequence stream an object pointer is declared and initialised using the appropriate ajAcdGet* function. Note that multiple sequence sets are written with using ajAcdGetSeqoutset

6.7.6.2.1. Single output sequence

    AjPSeqout seqout = NULL;

    seqout = ajAcdGetSeqout("outseq");

6.7.6.2.2. Multiple output sequences

    AjPSeqout seqout = NULL;

    seqout = ajAcdGetSeqoutall("outseq");

6.7.6.2.3. Output sequence set

    AjPSeqout seqout = NULL;

    seqout = ajAcdGetSeqoutset("outseq");

6.7.6.2.4. Multiple output sequence sets

    AjPSeqout seqout = NULL;

    seqout = ajAcdGetSeqoutset("outseq");

6.7.6.3. Alternative ACD Retrieval Functions

In cases where just the first of multiple sets is required ajAcdGetSeqsetallSingle is used to return a single seqset object:

AjPSeqset     ajAcdGetSeqsetallSingle (const char *token);

It is used as follows:

    AjPSeqset seqset = NULL;

    seqset = ajAcdGetSeqsetallSingle("sequence");

The memory for any additional sequence sets is automatically cleaned when embExit is called.

6.7.6.4. Processing Command line Options and ACD Attributes

6.7.6.4.1. Setting the Sequence Range

All sequence input datatypes have various inbuilt command line qualifiers (see the EMBOSS Users Guide) including -sbegin and -send which specify a start and end position for a sequence.

Whenever a sequence is read these values are held in the appropriate sequence object. Regardless of their value you still get the entire sequence loaded into memory. The function ajSeqTrim (or ajSeqsetTrim for a AjPSeqset object) is used to trim the sequence to the region defined by -sbegin and -send:

    AjPSeq seq = NULL;

    seq = ajAcdGetSeq("sequence");

    ajSeqTrim(seq);

6.7.6.4.2. Setting the Output Format

There is a function for setting the output format although this is typically specified by the user and handled during ACD file processing:

AjBool  ajSeqoutSetFormatS (AjPSeqout thys, const AjPStr format);

6.7.6.5. Writing Sequences

There are two AJAX functions for writing out sequence information.

void  ajSeqoutWriteSeq (AjPSeqout outseq, const AjPSeq seq);     
void  ajSeqoutWriteSet (AjPSeqout seqout, const AjPSeqset seq);

Note

An AjPSeq object is used to write sequences from a seqall object.

6.7.6.5.1. Writing a sequence

    AjPSeq seq = NULL;
    AjPSeqout seqout = NULL;

    seq    = ajAcdGetSeq("sequence");
    seqout = ajAcdGetSeqout("outseq");

    ajSeqoutWriteSeq(seqout, seq);

6.7.6.5.2. Writing sequences sequentially

Typically ajSeqallNext is used to iterate through the sequences in loop and pick up a pointer (seqPtr in this case) to the next sequence for writing:

    AjPSeqall seqall = NULL;
    AjPSeqout seqout = NULL;
    AjPSeq    seqPtr = NULL;

    seqall = ajAcdGetSeqall("sequence");
    seqout = ajAcdGetSeqout("outseq");

    while(ajSeqallNext(seqall, &seqPtr))
    {
        ajSeqoutWriteSeq(seqout, seqPtr);
    }

6.7.6.5.3. Writing a sequence set

Here the entire sequence set is written to file:

    AjPSeqset seqset = NULL;
    AjPSeqout seqout = NULL;

    seqset = ajAcdGetSeqset("sequence");
    seqout = ajAcdGetSeqout("outseq");

    ajSeqoutWriteSet(seqout, seqset);

6.7.6.5.4. Writing multiple sequence sets

Multiple sequence sets must be written one at a time. Array notation is used to access individual sets:

    AjPSeqset *seqset = NULL;
    AjPSeqout seqout = NULL;
    ajint     i = 0;

    seqset = ajAcdGetSeqsetall("sequence");
    seqout = ajAcdGetSeqout("outseq");

    for(i=0; seqset[i] != NULL; i++)
        ajSeqoutWriteSet(seqout, seqset[i]);

6.7.6.6. Memory and File Management

It is your responsibility to close any files and free up memory at the end of the program.

6.7.6.6.1. Closing Output Sequence Files

To close an output sequence stream use ajSeqoutClose:

    AjPSeqout seqout = NULL;

    seqout = ajAcdGetSeqout("outseq");

    ...

    ajSeqoutClose(seqout);

6.7.6.6.2. Freeing Memory

You must call the default destructor function (see below) on any sequence objects returned by calls to ajAcdGet*.

6.7.7. Sequence Object Memory Management

6.7.7.1. Default Object Construction

To use a sequence object that is not defined in the ACD file you must first instantiate the appropriate object pointer. Three default construction functions are provided, one for each access method:

AjPSeq      ajSeqNew (void);    /* Sequence        */
AjPSeqset   ajSeqsetNew (void); /* Sequence set    */
AjPSeqall   ajSeqallNew (void); /* Sequence stream */

All constructors return the address of a new object. The pointers do not need to be initialised to NULL but it is good practice to do so:

    AjPSeq       seq = NULL;
    AjPSeqset seqset = NULL;
    AjPSeqall seqall = NULL;

    seq    = ajSeqNew();
    seqset = ajSeqsetNew();
    seqall = ajSeqallNew();

    /* The objects are instantiated and ready for use */

Sequence output objects are typically loaded from ACD file processing (see above). In the unlikely event you do need to create a sequence output object manually you can use the default seqout object constructor:

    AjPSeqout       seqout = NULL;

    seqout = ajSeqoutNew();
    /* The object is instantiated and ready for use */

6.7.7.2. Default Object Destruction

There is a sequence object destruction function for each access method, and one for arrays of sequence sets:

void  ajSeqDel (AjPSeq* Pseq);
void  ajSeqallDel (AjPSeqall *Pseq);
void  ajSeqsetDel (AjPSeqset *Pseq);
void  ajSeqsetDelarray (AjPSeqset **PPseq);

They are used as follows:

    AjPSeq     seq = NULL;
    AjPSeqall  seqall = NULL;
    AjPSeqset  seqset = NULL;
    AjPSeqset *seqsets = NULL;


    seq     = ajAcdGetSeq("asequence");
    seqall  = ajAcdGetSeqall("bsequence");
    seqset  = ajAcdGetSeqset("csequence");
    seqsets = ajAcdGetSeqsetall("dsequence");


    ...

    ajSeqDel(&seq);
    ajSeqallDel(&seqall);
    ajSeqsetDel(&seqset);
    ajSeqsetDelarray(&seqsets);

You must free the memory for an object once you are finished with it. Destructor functions are provided:

void  ajSeqDel (AjPSeq* Pseq);       /* Sequence        */
void  ajSeqallDel (AjPSeqall *Pseq); /* Sequence stream */ 
void  ajSeqsetDel (AjPSeqset *Pseq); /* Sequence set    */

For convenience, there is also a destructor for arrays of seqset objects:

/* Destroy an array of sequence set objects and the array. */
void  ajSeqsetDelarray (AjPSeqset **PPseq);

    AjPSeq       seq = NULL;
    AjPSeqset seqset = NULL;
    AjPSeqall seqall = NULL;

    seq    = ajSeqNew();
    seqset = ajSeqsetNew();
    seqall = ajSeqallNew();

    /* Do something with the instantiated objects */

    ajSeqDel(&seq);
    ajSeqsetDel(&seqset);
    ajSeqallDel(&seqall);

    /* The memory is freed and the pointers reset to NULL, ready for re-use. */

    seq    = ajSeqNew();
    seqset = ajSeqsetNew();
    seqall = ajSeqallNew();

    /* Do something else with the new objects.  The pointer variable is reallocated. */

    ajSeqDel(&seq);
    ajSeqsetDel(&seqset);
    ajSeqallDel(&seqall);

    /* Done with the objects so the memory is freed. */

Similarly, for arrays of sequence sets:

    AjPSeqset *seqset = NULL;
    seqset = ajAcdGetSeqsetall("sequence");

    /* Do something with the sequence set array */

    ajSeqsetDelarray(&seqset);

6.7.7.3. Alternative Object Construction and Loading

When managing memory for the sequence objects, knowledge of a function's behaviour is required. There are three cases to discern:

A function requires a pre-existing object.
A function can use but does not require a pre-existing object, and will allocate one if necessary.
A function always allocates an object, and either returns a pointer to it or allocates an object pointer to it, the address of which is passed as an argument.

In most but not all cases it is obvious from the function name whether a function is a constructor or merely loads a sequence. In future code refactoring the functionality will be more obvious in the function names.

6.7.7.3.1. `Single sequence`

There are a variety of alternative ways to create a sequence object. To create a sequence with a reserved size use:

AjPSeq  ajSeqNewRes (size_t size);

A sequence can be created from a string where the name of the sequence is also provided. Other strings are used in order to create the sequence so three objects need to be freed later:

    AjPSeq seq     = NULL;
    AjPStr str     = NULL;
    AjPStr strname = NULL;

    str     = ajStrNewC("ASEQUENCE");
    strname = ajStrNewC("Sequence Name");

    seq = ajSeqNewNameS(str, strname);

    /* Do something with sequence */

    ajStrDel(&str);
    ajStrDel(&strname);
    ajSeqDel(&seq);

Or from a string where the start and end region (given as offsets) are defined. The string is copied into the sequence object and the sequence is trimmed to the region defined:

    AjPSeq seq = NULL;
    AjPStr str   = NULL;
    ajint  offstart = 1;
    ajint  offend   = 8;

    str = ajStrNewC("ASEQUENCE");

    seq = ajSeqNewRangeS(str, offstart, offend, ajFalse);

    ajStrDel(&str);
    ajSeqDel(&seq);

The ajSeqNewRangeS call above will create the subsequence "SEQUENCE". The last function argument is a boolean which, if set true, means that the specified offset values are reversed and the sequence is set to be reversed when printed. Had you called the function like this:

    seq = ajSeqNewRangeS(str, offstart, offend, ajTrue);

then the start of the sequence is set to offend (8), the end set to offstart (1) and the printed (reverse) sequence would be ECNEUQES:

    seq = ajSeqNewRangeS(str, offstart, offend, 1);

A sequence can also be created and initialised from an existing sequence by calling ajSeqNewSeq, which copies the original sequence, so two objects need destroying later:

    AjPSeq seq     = NULL;
    AjPSeq seqCopy = NULL;

    seq = ajAcdGetSeq("sequence");

    seqCopy = ajSeqNewSeq(seq);

    /* Do something with the sequences */

    ajSeqDel(&seq);
    ajSeqDel(&seqCopy);

An array of sequences can be created from an existing sequence set using ajSeqsetGetSeqarray. The array is created and copies of the sequences are taken. This function will be renamed in the future to reflect the fact that it is a constructor function. To ensure that all memory is freed, ajSeqsetGetSize is called to get the size of the array, and the array itself is freed by calling AJFREE:

    AjPSeqset  seqset = NULL;
    AjPSeq *     seqs = NULL;
    ajint        nseq;  /*Number of sequences in set */
    ajint        x;

    seqset=ajAcdGetSeqset("sequences");

    seqs = ajSeqsetGetSeqarray(seqset);

    /* Do something with the array of sequences */

    ajSeqsetDel(&seqset);
    nseq = ajSeqsetGetSize(seqset);

    for(x=0; x<nseq; x++)
        ajSeqDel(&seqs[x]);

    AJFREE(seqs);

6.7.7.3.2. Sequence set

Three functions are provided for loading a pre-existing AjPSeqset object with sequences.

A sequence set can be built from a pair of sequences (in this case from ACD file processing) by using ajSeqsetFromPair. This copies the sequences:

    AjPSeq       seqa = NULL;
    AjPSeq       seqb = NULL;
    AjPSeqset  seqset = NULL;
    ajint        nset = 0; /* Number of sequences in set */

    seqa = ajAcdGetSeq("asequence");
    seqb = ajAcdGetSeq("bsequence");

    seqset = ajSeqsetNew();
    nset   = ajSeqsetFromPair(seqset, seqa, seqb);

    /* Do something with set */

    ajSeqDel(&seqa);
    ajSeqDel(&seqb);
    ajSeqsetDel(&seqset);

Or from a list of sequences by using ajSeqsetFromList, which in contrast to ajSeqsetFromPair loads the set with the actual sequences (not copies). Therefore you have to be careful not to try to free the same objects twice. ajListFree is called to delete the list but not the nodes (sequences). The latter were deleted by ajSeqsetDel:

    AjPSeq       seqa = NULL;
    AjPSeq       seqb = NULL;
    AjPList      list = NULL;
    AjPSeqset  seqset = NULL;
    ajint        nset = 0; /* Number of sequences in set */

    seq1 = ajAcdGetSeq("asequence");
    seq2 = ajAcdGetSeq("bsequence");

    /* Build list of sequences */
    list = ajListNew();
    ajListPush(list, (void *)seq1);
    ajListPush(list, (void *)seq2);

    seqset = ajSeqsetNew();
    nset   =  ajSeqsetFromList(seqset, list);

    /* Do something with set */

    ajSeqsetDel(&seqset);
    ajListFree(&list);

You can append to a sequence set using ajSeqsetApp, which copies the sequences:

    AjPSeq       seqa = NULL;
    AjPSeq       seqb = NULL;
    AjPSeqset  seqset = NULL;
    ajint        nset = 0; /* Number of sequences in set */

    seq1 = ajAcdGetSeq("asequence");
    seq2 = ajAcdGetSeq("bsequence");

    seqset = ajSeqsetNew();
    ajSeqsetApp(seqset, seqa);
    ajSeqsetApp(seqset, seqb);


    /* Do something with set */

    ajSeqDel(&seqa);
    ajSeqDel(&seqb);
    ajSeqsetDel(&seqset);

6.7.7.3.3. Sequence output

An output sequence object can be created from an open file. In this case an output file isn't created during ACD file processing. The file is appropriated by the function so only one destructor needs to be called:

    AjPFile     file = NULL;
    AjPSeqout seqout = NULL;

    /* Open a file called "InputFileName" */
    file =  ajFileNewInNameC ("InputFileName");

    /* Create an output sequence object that uses this file */
    seqout = ajSeqoutNewFile(file);

    /* Do something with the output sequence object */

    ajSeqoutDel(&seqout);

An existing output sequence object can be loaded (and a file opened for writing) by using ajSeqoutOpenFilename which, given a valid filename, returns a boolean which is ajTrue if the file was opened ok:

    AjPSeqout seqout = NULL;
    AjPStr name = NULL;

    /* Create an empty seqout object */
    seqout = ajSeqoutNew();

    ajStrAssignC(&name, "InputFileName");

    if(!ajSeqoutOpenFilename(seqout, name))
        /* Error handling */;

    /* Do something with the output sequence object */

    ajSeqoutDel(&seqout);

6.7.7.3.4. Loading Sequence Objects Using a USA

Functions are provided to load the sequence input objects with one or more sequences, or initialise the output sequence object, given a USA (see the EMBOSS Users Guide) of an input or output sequence(s) or sequence stream.

6.7.7.3.4.1. `Single sequence`

A sequence can be loaded into a sequence object using ajSeqGetFromUsa. The second argument is true if the sequence is protein, and the function returns ajTrue if the sequence was loaded successfully:

    AjPSeq seq = NULL;
    AjPStr usa = NULL;

    usa = ajStrNewC("asis::ALVLLYNITEF");   /* "asis::ALVLLYNITEF" is a valid USA */
    seq = ajSeqNew();  /* Create sequence object */

    if(!ajSeqGetFromUsa(usa, 1, &seq))
        /* Do something to handle the failure */;
    else
        /* Sequence was loaded successfully .. do something with it */;

    ajSeqDel(&seq);

6.7.7.3.4.2. `Sequence set`

A sequence set can be created using ajSeqsetGetFromUsa in the same way as ajSeqGetFromUsa:

    AjPSeq seqset = NULL;
    AjPStr usa = NULL;
    seqset = ajSeqsetNew();  /* Create sequence object */

    usa = ajStrNewC("swiss:hba_*");   /* ""swiss:hba_*"" is a valid USA */

    if(!ajSeqsetGetFromUsa(usa, 1, &seqset))
        /* Do something to handle the failure */;
    else
        /* Sequence set was loaded successfully .. do something with it */;

    ajSeqsetDel(&seqset);

6.7.7.3.4.3. `Sequence stream`

A sequence input stream is created from a USA using ajSeqallFile, which in contrast to the previous two functions returns a pointer to the new object. This function will be renamed in the future to reflect the fact that it is a constructor function:

    AjPSeq seqall = NULL;
    AjPStr usa = NULL;

    usa = ajStrNewC("swiss:hba_*");   /* ""swiss:hba_*"" is a valid USA */

    if(!(seqall=ajSeqallFile(usa)))
        /* Do something to handle the failure */;
    else
        /* Sequence input stream was opened successfully .. do something with it */;

    ajSeqallDel(&seqall);

6.7.7.3.4.4. Sequence output

ajSeqoutClearUsa creates (if necessary) or resets a sequence output object using a new USA:

    AjPSeqout seqout = NULL;
    AjPStr       usa = NULL;

    usa = ajStrNewC("asis::ALVLLYNITEF");   /* "asis::ALVLLYNITEF" is a valid USA */

    seqout = ajSeqoutNew();
    ajSeqoutClearUsa (&seqout, usa);

    ajSeqoutDel(&seqout);

The code would work equally well if the call to the constructor function was omitted.

6.7.8. Getting and Setting Elements

AJAX functions are provided to recover or to set elements (data fields) in the basic sequence objects.

6.7.8.1. Setting Single Values

Functions to assign a value to one element of an existing sequence object have the prefix ajSeqAssign.

In the following example, ajSeqAssignDescC and ajSeqAssignNameS are used to assign a name and description to a region (user-defined by -sbegin and -send values) of the input sequence:

    AjPSeq seq    = NULL;
    AjPSeq seqsub = NULL;

    seq    = ajAcdGetSeq("sequence");
    seqsub = ajSeqNewSeq(seq);

    ajSeqTrim(seqsub);
    ajSeqAssignNameC(seqsub, "Subsequence");
    ajSeqAssignDescC(seqsub, "Subsequence defined by sbegin and send values.");
    ajSeqAssignAccC(seqsub, "YZ456789");

There are assignment functions for other sequence object elements: ajSeqAssignEntryC (entry name), ajSeqAssignFileC (file name), ajSeqAssignFullC (full sequece name), ajSeqAssignGiC (NCBI GI number), ajSeqAssignSeqC (sequence as a string), ajSeqAssignSvC (EMBL sequence version), ajSeqAssignUfoC (uniform feature object string for feature input), ajSeqAssignUsaC (sequence input uniform sequence address, e.g. filename:entryname) and versions using string parameters with the suffix S instead of C.

6.7.8.2. Other Assignment Functions

The ajSeqSet* family are functions are used to set other sequence properties (not single elements). For example ajSeqSetRange sets the range of a sequence object, in this case to the first 10 residues:

    AjPSeq  seq=NULL;
    ajint   start=0;
    ajint   end=9;

    seq    = ajSeqNew();

    ...

    ajSeqSetRange(seq, start, end);

There are set functions for other sequence object elements: ajSeqSetOffsets (sets the offset from the start, and the full length to define the offset at the end), ajSeqSetRangeRev (sets the range as above and sets the sequence to be reversed), ajSeqSetunique (makes a sequence unique so that the sequence characters can be modified without side effects)

"Set" functions are available for the other datatypes too:

    AjPSeqset  seqset=NULL;
    AjPSeqset  seqall=NULL;
    ajint   start = 0;
    ajint   end = 9;

    seqset    = ajSeqNew();
    seqall    = ajSeqNew();

    ...

    ajSeqsetSetRange(seq, start, end);
    ajSeqallSetRange(seq, start, end);

These are the only useful functions at the sequence stream and sequence set level. Other properties are set for each individual sequence in the stream ort set using the sequence "Set" or "Assign" functions.

6.7.8.3. Getting Single Values

The ajSeqGet* family of functions are used to return the contents of a sequence object. The functions return pointers to the object's actual contents so the caller must take care not to unintentionally change the returned value in any way. Here, two flavours of function are used. ajSeqGetDescC returns the sequence description whereas ajSeqGetNameS returns the sequence name. The suffix "S" is used when an EMBOSS string is returned whereas "C" indicates a C-type (char *) string is returned:

    AjPSeq seq     = NULL;
    AjPStr str     = NULL;
    AjPStr strcopy = NULL;
    char * strc    = NULL;

    seq = ajAcdGetSeq("sequence");

    strc = ajSeqGetDescC(seq);
    str  = ajSeqGetNameS(seq);

    /* Do something (but don't change) the returned strings */

    ajStrAssignS(&strcopy, str);

    /* Copy of string is now safe to change ... do something with it */

    ajSeqDel(&seq);
    ajStrDel(&str);

All the common sequence attributes can be retrieved like this, for example the accession and GI numbers using:

const  AjPStr ajSeqGetGiS(const AjPSeq seq)
const AjPStr  ajSeqGetAccS (const AjPSeq seq);

The sequence itself is retrieved using one of:

const char*  ajSeqGetSeqC (const AjPSeq seq);
const AjPStr  ajSeqGetSeqS (const AjPSeq seq);

The function ajSeqGetOffset will return the -sbegin value the user specified, i.e. the offset to the start of the sequence. ajSeqGetOffend will return the number of positions to be removed from the end as defined by the sequence length and the -send value:

    AjPSeq seq = NULL;
    ajint offset;
    ajint offend;

    seq = ajAcdGetSeq("sequence");

    offset = ajSeqGetOffset(seq);
    offend = ajSeqGetOffend(seq);

    /* Do something here using offset and offend, perhaps code to retain the trimmed regions. */

    ajSeqTrim(seq);

"Get" functions are available for the other datatypes. For instance to get the number of sequences and range of a sequence set, or the filename and USA of a sequence input stream:

    AjPSeq    seqset = NULL;
    AjPSeqall seqall = NULL;
    ajint     start;
    ajint     end;
    ajint     nseq;
    AjPStr    name = NULL;
    AjPStr    usa = NULL;

    seqset = ajAcdGetSeqset("sequence");
    seqall = ajAcdGetSeqall("sequences");

    nseq = ajSeqsetGetSize(seqset);
    ajSeqsetGetRange(seqset, &start, &end);

    name = ajSeqallGetFilename(seqall);
    usa  = ajSeqallGetUsa(seqall);

There are "Get" functions that return a property of a sequence in a set. These have the prefix ajSeqsetGetseq. The number of the required sequence is specified. For instance three functions are provided to return a given sequence in a set, either as a sequence object pointer, an EMBOSS string or as a C-type (char *) string. Again, care should be taken not to unwittingly change the object data in any way:

    AjPSeq    seqset = NULL;

    AjPSeq  seqPtr
    AjPStr  seqS;
    char *  seqC;

    seqset = ajAcdGetSeqset("sequence");

    /* First sequence in call cases */
    seqPtr = ajSeqsetGetseqSeq (seqset, 1);    
    seqS   = ajSeqsetGetseqSeqS (seqset, 1);
    seqC   = ajSeqsetGetseqSeqC (seqset, 1);

Functions to return a property of a sequence in a stream have the prefix ajSeqallGetseq and return properties of the current sequence in the stream. For instance, to get the name and length of the current sequence:

    AjPSeq    seqall = NULL;
    AjPStr    name=NULL;
    ajint     len;

    seqall = ajAcdGetSeqall("sequences");

    name = ajSeqallGetseqName (seqall);
    len  = ajSeqallGetseqLen(seqall)

6.7.9. Testing Sequence Properties

Sequence properties can be tested, e.g. for whether a sequence is protein or nucleotide. Such functions return AjBool:

    AjPSeq       seq = NULL;
    AjPSeqset seqset = NULL;

    seq    = ajAcdGetSeq("sequence");
    seqset = ajAcdGetSeq("sequences");

    if(ajSeqIsProt(seq))
        /* Protein sequence */;
    else if(ajSeqIsNuc(seq))
        /* Nucleotide sequence */;
    else
        /* Error handling */;

    if(ajSeqsetIsNuc(seqset)   
    {  
        if(ajSeqsetIsRna (seqset))
            /* Do something. */;
    }
    else if(ajSeqsetIsNuc(seqset)   
    {
        /* Do something else */
    }

Such tests are seldom required however as the sequence type can be enforced at the level of the ACD file.

6.7.10. Calculating Sequence Properties

Various properties can be calculated for a sequence. For example, ajSeqCalcCount writes an array with the number of A, C, G and T nucleotides in a nucleotide sequence, whereas ajSeqCalcMolwt calculates the molecular weight of a protein sequence:

    AjPSeq       seq = NULL;
    ajint nbases[5];
    float molwt;

    seq    = ajAcdGetSeq("sequence");

    if(ajSeqIsNuc(seq))
        ajSeqCalcCount(seq, *nbases);
    else if(ajSeqIsProt(seq))
        molwt = ajSeqCalcMolwt(const AjPSeq seq);

Note

Note that various calculated attributes of sequences (Section A.6, “Calculated Attributes”), such as length and type, are available at the level of the ACD file.

6.7.11. Sequence String Functions

There are two functions to copy a sequence into a string, using as a target either an EMBOSS string or a C-type (char *) string:

char*  ajSeqGetSeqCopyC (const AjPSeq seq);
AjPStr  ajSeqGetSeqCopyS (const AjPSeq seq);

There are a few functions for processing sequences that are held as strings using an AjPStr object. These have the prefix ajSeqstr. Here ajSeqGetSeqCopyS is first used to copy the sequence into a string. The molecular weight of a protein sequence is calculated, or the reverse and complement of a nucleotide sequence:

    AjPSeq seq = NULL;
    AjPStr str = NULL;
    float  molwt;

    seq  = ajAcdGetSeq("sequence");

    str  = ajSeqGetSeqCopyS(seq); 

    if(ajSeqIsNuc(seq))
        ajSeqstrReverse(&str);
    else if(ajSeqIsProt(seq))
        molwt = ajSeqCalcMolwt(str);

6.7.12. Processing Sequences

Various functions for processing a sequence are provided. You've already come across ajSeqTrim and ajSeqsetTrim for trimming sequences to their start and end regions but other operations are possible. For example, gap characters can be processed using the following functions:

void  ajSeqGapFill (AjPSeq seq, ajuint len);
void  ajSeqGapStandard (AjPSeq seq, char gapchar);
void  ajSeqGapLen (AjPSeq thys, char gapc, char padc, ajint ilen);

ajSeqGapFill appends gap characters to a sequence up to a specified maximum sequence length. ajSeqGapStandard makes all the gaps in a string use the standard gap character provided. Any dash, tilde, full stop, question mark or blank space characters (- ~ . ? ) in the string are converted. ajSeqGapLen sets any non-sequence characters to valid gap characters, and pads with extra gaps if necessary to a specified length.

To count the gaps in a sequence (counting any possible gap character) call ajSeqCountGaps:

ajuint  ajSeqCountGaps (const AjPSeq seq);

Functions are available to complement, or reverse and complement a nucleotide sequence, if it hasn't already been done from ACD or command-line processing:

void  ajSeqComplement (AjPSeq seq);
void  ajSeqReverseDo (AjPSeq seq)

There are a few functions to convert a sequence, or all sequences in a set, to lower or upper case:

void  ajSeqsetFmtLower (AjPSeqset seq);
void  ajSeqsetFmtUpper (AjPSeqset seq);
void  ajSeqFmtLower (AjPSeq thys);
void  ajSeqFmtUpper (AjPSeq thys);

ajSeqsetFill can be used to fill a sequence set by putting gaps at the ends of any short sequences:

ajint ajSeqsetFill(AjPSeqset seq)

6.7.13. Sequence Type Validation

In most cases the validation of sequence type (see Section A.7, “Sequence Types”), i.e. whether a sequence is for a protein or nucleic acid, contains gaps and so on, is handled at the level of the ACD file. This means that the application always receives (via a call to ajAcdGet*) a sequence of the exact type specified in the ACD file.

Several functions are provided in AJAX for handling sequence type directly. The type of a sequence can be set automatically, i.e. determined from the sequence itself, by calling ajSeqType. Otherwise it can be set directly for a sequence, or a sequence set, to one of nucleotide or protein using the following functions:

void  ajSeqType (AjPSeq thys);        
void  ajSeqSetNuc (AjPSeq thys);      
void  ajSeqSetProt (AjPSeq thys);     
void  ajSeqsetSetNuc (AjPSeqset thys);   
void  ajSeqsetSetProt (AjPSeqset thys);

6.7.14. Sequence Conversion

It is sometimes necessary to convert a sequence into numeric form for convenient processing. The AJAX datatype for this is:

AjPSeqCvt: Used for sequence conversion into numeric form.

The common use-case is where a sequence character conversion table is generated for a scoring matrix, such as an amino acid substitution scoring matrix (see Section 6.10, “Handling Comparison Matrices”).

6.7.15. Handling Sequence Translation

Translation of a genetic sequence into a protein is a common task and is covered elsewhere (Section 6.8, “Handling Sequence Translation”).

6.7.16. Handling IUB Base Codes

EMBOSS supports the official IUB base codes as shown below and also X to indicate an unknown base.

6.7.16.1. Unambiguous codes

Unambiguous base codes are:

A (Adenine)
C (Cytosine)
G (Guanine)
T (Thymine) or U (Uracil)

6.7.16.2. Ambiguity codes

Either of two bases possible:

M (A or C)
R (A or G)
W (A or T/U)
S (C or G)
Y (C or T/U)
K (G or T/U)

Any of three bases possible:

B (C G or T/U)
D (A G or T/U)
H (A C or T/U)
V (A C or G)

Any of the four bases possible:

N (A C G or T/U)

An unknown (or non-standard) base:

X (not A C G or T/U)

6.7.16.3. Processing an IUB Base Code

The standard IUB base codes have 4 possibilities which can be represented in binary as 4 bits for rapid comparison including testing of ambiguity codes. Functions are provided for the character and binary representations of base codes.

The ambiguity codes and their numeric representations are read from local data file Ebases.iub

/* Sets up binary OR representation of all IUB bases in a table aj_base_table where A=1, C=2, G=4 and T=8. Also sets up a match probability array aj_base_prob holding the probability of one IUB base matching any other. Uses the Ebases.iub file.  Is initialised if necessary from other AJAX functions. */

/* Complements a nucleotide base. */
char   ajBaseAlphacharComp(char base);     

/* Returns an element of the base match probability array */
float  ajBaseAlphaCompare(ajint base1, ajint base2);

To clean up internal memory used for housekeeping of sequence base processing call ajBaseExit:

void  ajBaseExit (void);

This is called automatically by embExit.

6.7.16.4. Conversion of Base and Amino Acid Codes

Functions for converting an integer representation of a base code into something else:

/* Convert to string of matching base code(s). */
const AjPStr  ajBaseGetCodes (ajint ibase);    

/*  Convert to base code: 0 to 'A', 1 to 'B' ... 25 to 'Z'.  Returns an integer. */
ajint  ajBasecodeFromInt (ajint n);

To convert base code into something else:

/*    Convert to integer: 'A' to 0, 'B' to 1, ... 'Z' to 25.  Returns an integer. */
ajint  ajBasecodeToInt (ajint c);         

/* Convert to binary OR representation of an IUB base. */   
ajint  ajBaseAlphaToBin (ajint c);         

/* Convert to binary OR representation of an IUB base. */   
char  ajBaseAlphacharToBin (char c);

ajBaseAlphaToBin and ajBaseAlphacharToBin return a binary OR representation of an IUB base where A=1, C=2, G=4 and T=8. These functions use an internal base table.

To convert a binary OR representation of an IUB base into something else:

/* Converts a binary OR representation of an IUB base to an ambiguous DNA base code. */
char  ajBaseBinToAlpha (ajint c);

ajBaseBinToAlpha converts a binary OR representation of an IUB base where A=1, C=2, G=4 and T=8 into an ambiguous DNA base code (uses T rather than U). It uses the internal base table.

To interconvert single and three letter amino acid codes:

/* Convert single letter to three letter amino acid code. */
AjBool  ajResidueToTriplet (char c, AjPStr *Paa3);          

/* Convert three letter to single letter amino acid code. */
AjBool  ajResidueFromTriplet (char *aa4, char *Pc);

6.7.17. DNA Calculations

Various functions for calculating the properties of nucleotide sequences are provided in ajdan.h/c. These functions use a DNA melting object (AjPMelt) but this is transparent to the application programmer, you will never need to instantiate or use the object directly:

AjPMelt: Used for DNA melting.

6.7.17.1. General Functions for DNA Melting

The AjPMeltInit and AjPMeltExit functions are used internally to initialise and clean up. There is no need to call them directly. The AjPMeltGC function is a utility to calculate G+C content.

/* Initialises melt entropies, enthalpies and energies. Different data files are read for DNA or RNA heteroduplex. Also sets optional flag for array saving of the above. */
void  ajMeltInit (AjBool isdna, ajint savesize);

/* Cleans up DNA melting processing internal memory */
void  ajMeltExit (void);

/* Calculates GC fraction of a sequence allowing for ambiguity */
float  ajMeltGC (const AjPStr strand, ajint len);

6.7.17.2. Calculating DNA Melting and Annealing Temperatures and Energies

AjPMeltEnergy calculates melt energy and enthalpy/entropy for a sequence string. AjPMeltEnergy2 is an alternative interface that initialises a set of 'save' arrays to make later calls more efficient by inspecting the sequence only once. AjPMeltTemp calculates a melting temperature for an oligonucleotide, with a AjPMeltTempSave version that uses 'savce' arrays for efficiency. AjPMeltTempProd calculated the melting temperature of an oligonucleotide product under a given salt concentration. AjPMeltTemp caclulates the annealing temperature of a product and primer, given their respective melting temperatures.

float  ajMeltEnergy(const AjPStr strand, ajint len,
                    ajint shift, AjBool isDNA,
                    AjBool maySave, float *enthalpy, float *entropy);

float  ajMeltEnergy2(const char *strand, ajint pos, ajint len,           
                     AjBool isDNA,
                     float *enthalpy, float *entropy,
                     float **saveentr, float **saveenth, float **saveener);

float  ajMeltTemp (const AjPStr strand, ajint len, ajint shift, float saltconc,           
                   float DNAconc, AjBool isDNA);

float  ajMeltTempSave(const char *strand, ajint pos,
                      ajint len, float saltconc,            
                      float DNAconc, AjBool isDNA,
                      float **saveentr, float **saveenth, float **saveener);

float  ajMeltTempProd (float gc, float saltconc, ajint len);

float  ajAnneal (float tmprimer, float tmproduct);

Prev	Up	Next
6.6. Handling Sequence Patterns	Home	6.8. Handling Sequence Translation