There are a wide variety of sequence formats currently in use. They define the permitted layout and content of text in a file, including required text tokens that are used to define the fields in the databank. These fields include:
The sequence identifier name
Accession number
Bibliographic information
Sequence annotation including a description of any features
The sequence itself
Molecular features associated with the sequence, such as protein secondary structure or molecular recognition sites, are kept in a feature table.
The bibliographic information includes data such as:
Literature references
Experimental details
Author contact information
Cross-links to other databases
Descriptions, annotations and comments around the sequence
Much more besides!
An entry in a sequence databank will usually include a code to uniquely identify an entry in the database. The code is often an accession number and/or identifier name (ID). An accession number is a unique alphanumeric identifier guaranteed to stay with that sequence for the life of the database. If two sequences are merged into one then the new sequence will get a new accession number; the accession numbers of the merged sequences will be retained as 'secondary' accession numbers. EMBL, GenBank and SwissProt share an accession numbering scheme, so a given accession number uniquely identifies a sequence within these three databases. ID names, in contrast, are not guaranteed to remain the same between different versions of a database, although in practice they usually do.
FASTA is the most commonly used format, followed by those used in the major sequence databases. EMBOSS recognises nearly all of the known formats for input, and most can be used for output too. Sequences are referred to on the EMBOSS command line by their Uniform Sequence Address or USA (see the EMBOSS Users Guide). This is a standard sequence naming scheme used by all EMBOSS applications. A USA specifies one or more sequences that might be read from or written to a file or a larger databank. Other sequence sources such as applications or a web server can also be specified.
There is also a set of command line qualifiers that are used for sequence input and output. They allow you to set such things as file format, sequence regions, database and entry names etc. Such qualifiers (Section A.5, “Datatype-specific Attributes”) may also be "hard-coded" as attributes in the ACD file.
As you would expect, AJAX includes very well developed functionality for handling sequences. This includes:
Recovering or setting elements in the sequence objects.
Testing sequence properties, for instance whether a sequence is protein or nucleotide.
Calculating sequence properties, for example base composition of a nucleotide sequence or the molecular weight of a protein sequence.
Interconversion of sequences and strings.
Processing sequences, for instance trimming it to defined regions or removing gap characters.
Sequence type validation; whether a sequence is protein or nucleic acid, contains gaps and so on. This is usually handled at the level of the ACD file but can be done directly.
Conversion of sequence to numerical form for convenient processing, for example as an index for a scoring matrix.
Translation of a genetic sequence into a protein.
Interconversion of nucleotide bases and amino acid codes.
Calculating the properties of nucleotide sequences, for example DNA melting and annealing temperatures and energies.
Application input and output sequences are typically defined in the ACD file and retrieved by appropriate calls to ajAcdGet*
functions. There are ACD datatypes for input of single sequences, sequential access to multiple sequences or multiple sequences in one or more sets. There are corresponding datatypes for sequence output. Sequence objects can also be created directly if that's required. Sequences are held internally as strings and required memory is dynamically reallocated as needed. This means that there are no arbitrary limits when using a function with a sequence object.
AJAX library files for handling sequences are listed in the table (Table 6.9, “AJAX Library Files for Handling Sequences”). Library file documentation, including a complete description of datatypes and functions, is available at:
http://emboss.open-bio.org/rel/dev/libs/ |
Library File Documentation | Description |
---|---|
ajseq | General sequence handling |
ajseqdata | Sequence datatypes |
ajseqread | Sequence reading |
ajseqwrite | Sequence writing |
ajseqdb | Sequence database access methods |
ajseqabi | Sequence parsing from ABI trace file |
ajseqtype | Validation and processing for sequences types |
ajtranslate | Sequence translation |
ajbase | Base conversion functions |
ajdan | Nucleic acid functions |
ajseq.h/c
. Most of the functions you will ever need for general sequence handling. They define the AjPSeqCvt
object used for sequence conversion (Section 6.7.14, “Sequence Conversion”.
ajseqdata.h/c
. Define the basic sequence objects (AjPSeq
, AjPSeqset
and AjPSeqall
) for general use e.g. retrieving sequences via ACD file processing. They also define a sequence input object (AjPSeqin
) and a sequence query object (AjPSeqQuery
) used for low level sequence retrieval.
ajseqread.h/c
. Functions for handling the sequence query and sequence input objects and a few general sequence handling functions. They also define static data structures and functions for handling sequence input from the supported formats at a low level.
ajseqwrite.h/c
. Defines the sequence output object (AjPSeqout
) and include functions for writing sequences to file. They define static data structures and functions for handling sequence output, in the supported formats, at a low level.
ajseqdb.h/c
. Defines a sequence database reading object (SeqSAccess
) used in the definition of AjPSeqin
and static data structures and functions for handling all aspects of sequence database access.
ajseqabi.h/c
. Provides functions for parsing sequences (and other data) from ABI trace files. See the library documentation () for further information.
ajseqtype.h/c
. Functions for handling of sequence types. They also contain static data structures and functions for validation and processing of sequence types at a low level.
ajtranslate.h/c
. Defines a sequence translation object (AjPTrn
) and include functions for handling sequence translation (Section 6.8, “Handling Sequence Translation”).
ajbase.h/c
. Contains functions for character conversion of nucleotide bases and other character conversion functions for use with molecular sequences. For further information see Section 6.7.16, “Handling IUB Base Codes”.
ajdan.h/c
. Contains functions for calculating properties of nucleotide sequences, mostly related to DNA melting/annealing. For further information see Section 6.7.17, “DNA Calculations”.
You are unlikely to need the static data structures and functions in the above files unless you plan to implement code to support new sequence formats for EMBOSS. For advice on how to do this ask the EMBOSS developers.
Detailed usage notes for ajseqdata.h
, ajseqread.h/c
and ajseqwrite.h/c
are not given here. For advice on how to use the datatypes and functions ask the EMBOSS developers.
There are four ACD datatypes and corresponding methods for handling sequence input. Often just a single sequence is required. Where multiple sequences are needed they may be read either sequentially, for example from a database, or read and held in memory as a set, for example from a sequence alignment:
There are three datatypes for handling sequence output to match their input equivalents. They have identical behaviour :
Typical ACD definitions for sequence input:
# single input sequence sequence: sequence [ parameter: "Y" type: "gapany" ] # multiple sequence input read sequentially seqall: sequence [ parameter: "Y" type: "gapany" ] # multiple sequence input read as a single set seqset: sequence [ parameter: "Y" type: "gapany" aligned: "N" ] # multiple sequence input read as multiple sets seqsetall: sequence [ parameter: "Y" type: "gapany" aligned: "N" ]
Typical ACD definitions for sequence output:
# single sequence seqout: outseq [ parameter: "Y" type: "gapany" ] # multiple sequence output (sequential writing) seqoutall: outseq [ parameter: "Y" type: "gapany" ] # multiple sequence output (written as a set) seqoutset: outseq [ parameter: "Y" type: "gapany" ]
All data definitions for sequence input and output should have standard parameter names. These include:
sequence
for any sequence inputs
outseq
for any outputs
Alternatives and variations (e.g. asequence
and bsequence
for multiple inputs are allowed.
Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).
parameter:
Sequences are typically the primary input or output of an EMBOSS application and, as such, should be defined as parameters by using the global attribute parameter: "Y"
.
type:
Specifies the type of the input or output sequence(s) for validation purposes. See Section A.7, “Sequence Types”.
aligned:
Any seqset
or seqsetall
datatype must have the aligned:
attribute set to indicate whether the sequences are aligned or not.
features:
Sequence features (Section 6.9, “Handling Features”) will (if available) be read or written if the features:
ACD attribute is set for the input or output sequence(s).
osformat:
FASTA format is used by default for the output sequence(s). The format is normally set at the command line but a default may be hard-coded with osformat:
. All common sequence formats are supported (see the EMBOSS Users Guide).
There are 9 AJAX datatypes for handling sequence. Four of these are for general use including handling sequences defined in the ACD file:
AjPSeq
A single sequence (for sequence
ACD datatype).
AjPSeqall
Sequence input stream for sequential access to multiple sequences (for seqall
ACD datatype).
AjPSeqset
Multiple sequences in one or more sets (for seqset
and seqsetall
ACD datatypes).
AjPSeqout
Sequence output stream (for seqout
ACD datatype).
Two are for sequence-related operations:
The rest are for lower level sequence handling beyond that provided by the static datatypes in the various library files:
AjPSeqin
Holds the sequence USA specification and information needed to read the sequence and possible further sequences.
SeqSAccess
Holds information needed to read a sequence from a database: access methods are defined for each known database type and sequences are read from the database using the defined database access function. It is used in the definition of AjPSeqin
.
AjPSeqQuery
Used for interpreting the entry specification part of a USA (see the EMBOSS Users Guide).
You are unlikely to need these specialised objects unless you plan to implement code to support new sequence formats for EMBOSS. For advice on how to do this ask the EMBOSS developers.
Datatypes and functions for handling sequences via the ACD file are shown below (Table 6.10, “Datatypes and Functions for Sequence Input and Output”).
To read a single sequence | To read a set of sequences (e.g. an alignment) | To read sequences sequentially from a database | |
Sequence Input | |||
---|---|---|---|
ACD datatype | sequence | seqall | seqset |
AJAX datatype | AjPSeq | AjPSeqall | AjPSeqset |
To retrieve from ACD | ajAcdGetSeq | ajAcdGetSeqall | ajAcdGetSeqset |
To get individual sequences | (ajAcdGetSeq) | ajSeqallNext | ajSeqsetGetSize, ajSeqsetGetseqSeq |
Sequence Output | |||
To write a single sequence | To write sequences to a database | To write a set of sequences (e.g. an alignment) | |
ACD datatype | seqout | seqoutall | seqoutset |
AJAX datatype | AjPSeqout | AjPSeqout | AjPSeqout |
To open the output stream | ajAcdGetSeqout | ajAcdGetSeqoutall | ajAcdGetSeqoutset |
To write sequences to file | ajSeqoutWriteSeq | ajSeqoutWriteSeq | ajSeqoutWriteSet |
Your application code will call embInit
to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet*
family of functions which return pointers to appropriate objects.
To retrieve an input sequence an object pointer is declared and initialised using the appropriate ajAcdGet*
function.
A seqall
object is really a means to an end of returning individual sequences in a loop. ajSeqallNext
is used to iterate through the sequences:
AjPSeqall seqall = NULL; AjPSeq seq = NULL; seqall = ajAcdGetSeqall("sequence"); while(ajSeqallNext(seqall, &seq)) { /* Do something with 'seq' */ }
AjPSeqset seqset = NULL; seqset = ajAcdGetSeqset("sequence");
ajAcdGetSeqsetall
is used which returns a pointer for an array of seqset
objects:
AjPSeqset *seqset = NULL; seqset = ajAcdGetSeqsetall("sequence");
Of course, to access an individual set the array notation is used. Code to achieve the same effect as ajAcdGetSeqsetallSingle
is as follows:
AjPSeqset *seqset = NULL; AjPSeqset seqsetFirst=NULL; seqset = ajAcdGetSeqsetall("sequence"); seqsetFirst = seqset[0]; /* Picks up the pointer to the first sequence set. */
To retrieve an output sequence stream an object pointer is declared and initialised using the appropriate ajAcdGet*
function. Note that multiple sequence sets are written with using ajAcdGetSeqoutset
In cases where just the first of multiple sets is required ajAcdGetSeqsetallSingle
is used to return a single seqset
object:
AjPSeqset ajAcdGetSeqsetallSingle (const char *token);
It is used as follows:
AjPSeqset seqset = NULL; seqset = ajAcdGetSeqsetallSingle("sequence");
The memory for any additional sequence sets is automatically cleaned when embExit
is called.
All sequence input datatypes have various inbuilt command line qualifiers (see the EMBOSS Users Guide) including -sbegin
and -send
which specify a start and end position for a sequence.
Whenever a sequence is read these values are held in the appropriate sequence object. Regardless of their value you still get the entire sequence loaded into memory. The function ajSeqTrim
(or ajSeqsetTrim
for a AjPSeqset
object) is used to trim the sequence to the region defined by -sbegin
and -send
:
AjPSeq seq = NULL; seq = ajAcdGetSeq("sequence"); ajSeqTrim(seq);
There are two AJAX functions for writing out sequence information.
void ajSeqoutWriteSeq (AjPSeqout outseq, const AjPSeq seq); void ajSeqoutWriteSet (AjPSeqout seqout, const AjPSeqset seq);
An AjPSeq
object is used to write sequences from a seqall
object.
AjPSeq seq = NULL; AjPSeqout seqout = NULL; seq = ajAcdGetSeq("sequence"); seqout = ajAcdGetSeqout("outseq"); ajSeqoutWriteSeq(seqout, seq);
Typically ajSeqallNext
is used to iterate through the sequences in loop and pick up a pointer (seqPtr
in this case) to the next sequence for writing:
AjPSeqall seqall = NULL; AjPSeqout seqout = NULL; AjPSeq seqPtr = NULL; seqall = ajAcdGetSeqall("sequence"); seqout = ajAcdGetSeqout("outseq"); while(ajSeqallNext(seqall, &seqPtr)) { ajSeqoutWriteSeq(seqout, seqPtr); }
Here the entire sequence set is written to file:
AjPSeqset seqset = NULL; AjPSeqout seqout = NULL; seqset = ajAcdGetSeqset("sequence"); seqout = ajAcdGetSeqout("outseq"); ajSeqoutWriteSet(seqout, seqset);
Multiple sequence sets must be written one at a time. Array notation is used to access individual sets:
AjPSeqset *seqset = NULL; AjPSeqout seqout = NULL; ajint i = 0; seqset = ajAcdGetSeqsetall("sequence"); seqout = ajAcdGetSeqout("outseq"); for(i=0; seqset[i] != NULL; i++) ajSeqoutWriteSet(seqout, seqset[i]);
It is your responsibility to close any files and free up memory at the end of the program.
To close an output sequence stream use ajSeqoutClose
:
AjPSeqout seqout = NULL; seqout = ajAcdGetSeqout("outseq"); ... ajSeqoutClose(seqout);
To use a sequence object that is not defined in the ACD file you must first instantiate the appropriate object pointer. Three default construction functions are provided, one for each access method:
AjPSeq ajSeqNew (void); /* Sequence */ AjPSeqset ajSeqsetNew (void); /* Sequence set */ AjPSeqall ajSeqallNew (void); /* Sequence stream */
All constructors return the address of a new object. The pointers do not need to be initialised to NULL
but it is good practice to do so:
AjPSeq seq = NULL; AjPSeqset seqset = NULL; AjPSeqall seqall = NULL; seq = ajSeqNew(); seqset = ajSeqsetNew(); seqall = ajSeqallNew(); /* The objects are instantiated and ready for use */
Sequence output objects are typically loaded from ACD file processing (see above). In the unlikely event you do need to create a sequence output object manually you can use the default seqout
object constructor:
AjPSeqout seqout = NULL; seqout = ajSeqoutNew(); /* The object is instantiated and ready for use */
There is a sequence object destruction function for each access method, and one for arrays of sequence sets:
void ajSeqDel (AjPSeq* Pseq); void ajSeqallDel (AjPSeqall *Pseq); void ajSeqsetDel (AjPSeqset *Pseq); void ajSeqsetDelarray (AjPSeqset **PPseq);
They are used as follows:
AjPSeq seq = NULL; AjPSeqall seqall = NULL; AjPSeqset seqset = NULL; AjPSeqset *seqsets = NULL; seq = ajAcdGetSeq("asequence"); seqall = ajAcdGetSeqall("bsequence"); seqset = ajAcdGetSeqset("csequence"); seqsets = ajAcdGetSeqsetall("dsequence"); ... ajSeqDel(&seq); ajSeqallDel(&seqall); ajSeqsetDel(&seqset); ajSeqsetDelarray(&seqsets);
You must free the memory for an object once you are finished with it. Destructor functions are provided:
void ajSeqDel (AjPSeq* Pseq); /* Sequence */ void ajSeqallDel (AjPSeqall *Pseq); /* Sequence stream */ void ajSeqsetDel (AjPSeqset *Pseq); /* Sequence set */
For convenience, there is also a destructor for arrays of seqset
objects:
/* Destroy an array of sequence set objects and the array. */ void ajSeqsetDelarray (AjPSeqset **PPseq);
AjPSeq seq = NULL; AjPSeqset seqset = NULL; AjPSeqall seqall = NULL; seq = ajSeqNew(); seqset = ajSeqsetNew(); seqall = ajSeqallNew(); /* Do something with the instantiated objects */ ajSeqDel(&seq); ajSeqsetDel(&seqset); ajSeqallDel(&seqall); /* The memory is freed and the pointers reset to NULL, ready for re-use. */ seq = ajSeqNew(); seqset = ajSeqsetNew(); seqall = ajSeqallNew(); /* Do something else with the new objects. The pointer variable is reallocated. */ ajSeqDel(&seq); ajSeqsetDel(&seqset); ajSeqallDel(&seqall); /* Done with the objects so the memory is freed. */
Similarly, for arrays of sequence sets:
AjPSeqset *seqset = NULL; seqset = ajAcdGetSeqsetall("sequence"); /* Do something with the sequence set array */ ajSeqsetDelarray(&seqset);
When managing memory for the sequence objects, knowledge of a function's behaviour is required. There are three cases to discern:
A function requires a pre-existing object.
A function can use but does not require a pre-existing object, and will allocate one if necessary.
A function always allocates an object, and either returns a pointer to it or allocates an object pointer to it, the address of which is passed as an argument.
In most but not all cases it is obvious from the function name whether a function is a constructor or merely loads a sequence. In future code refactoring the functionality will be more obvious in the function names.
There are a variety of alternative ways to create a sequence object. To create a sequence with a reserved size use:
AjPSeq ajSeqNewRes (size_t size);
A sequence can be created from a string where the name of the sequence is also provided. Other strings are used in order to create the sequence so three objects need to be freed later:
AjPSeq seq = NULL; AjPStr str = NULL; AjPStr strname = NULL; str = ajStrNewC("ASEQUENCE"); strname = ajStrNewC("Sequence Name"); seq = ajSeqNewNameS(str, strname); /* Do something with sequence */ ajStrDel(&str); ajStrDel(&strname); ajSeqDel(&seq);
Or from a string where the start and end region (given as offsets) are defined. The string is copied into the sequence object and the sequence is trimmed to the region defined:
AjPSeq seq = NULL; AjPStr str = NULL; ajint offstart = 1; ajint offend = 8; str = ajStrNewC("ASEQUENCE"); seq = ajSeqNewRangeS(str, offstart, offend, ajFalse); ajStrDel(&str); ajSeqDel(&seq);
The ajSeqNewRangeS
call above will create the subsequence "SEQUENCE
". The last function argument is a boolean which, if set true, means that the specified offset values are reversed and the sequence is set to be reversed when printed. Had you called the function like this:
seq = ajSeqNewRangeS(str, offstart, offend, ajTrue);
then the start of the sequence is set to offend
(8), the end set to offstart
(1) and the printed (reverse) sequence would be ECNEUQES
:
seq = ajSeqNewRangeS(str, offstart, offend, 1);
A sequence can also be created and initialised from an existing sequence by calling ajSeqNewSeq
, which copies the original sequence, so two objects need destroying later:
AjPSeq seq = NULL; AjPSeq seqCopy = NULL; seq = ajAcdGetSeq("sequence"); seqCopy = ajSeqNewSeq(seq); /* Do something with the sequences */ ajSeqDel(&seq); ajSeqDel(&seqCopy);
An array of sequences can be created from an existing sequence set using ajSeqsetGetSeqarray
. The array is created and copies of the sequences are taken. This function will be renamed in the future to reflect the fact that it is a constructor function. To ensure that all memory is freed, ajSeqsetGetSize
is called to get the size of the array, and the array itself is freed by calling AJFREE
:
AjPSeqset seqset = NULL; AjPSeq * seqs = NULL; ajint nseq; /*Number of sequences in set */ ajint x; seqset=ajAcdGetSeqset("sequences"); seqs = ajSeqsetGetSeqarray(seqset); /* Do something with the array of sequences */ ajSeqsetDel(&seqset); nseq = ajSeqsetGetSize(seqset); for(x=0; x<nseq; x++) ajSeqDel(&seqs[x]); AJFREE(seqs);
Three functions are provided for loading a pre-existing AjPSeqset
object with sequences.
A sequence set can be built from a pair of sequences (in this case from ACD file processing) by using ajSeqsetFromPair
. This copies the sequences:
AjPSeq seqa = NULL; AjPSeq seqb = NULL; AjPSeqset seqset = NULL; ajint nset = 0; /* Number of sequences in set */ seqa = ajAcdGetSeq("asequence"); seqb = ajAcdGetSeq("bsequence"); seqset = ajSeqsetNew(); nset = ajSeqsetFromPair(seqset, seqa, seqb); /* Do something with set */ ajSeqDel(&seqa); ajSeqDel(&seqb); ajSeqsetDel(&seqset);
Or from a list of sequences by using ajSeqsetFromList
, which in contrast to ajSeqsetFromPair
loads the set with the actual sequences (not copies). Therefore you have to be careful not to try to free the same objects twice. ajListFree
is called to delete the list but not the nodes (sequences). The latter were deleted by ajSeqsetDel
:
AjPSeq seqa = NULL; AjPSeq seqb = NULL; AjPList list = NULL; AjPSeqset seqset = NULL; ajint nset = 0; /* Number of sequences in set */ seq1 = ajAcdGetSeq("asequence"); seq2 = ajAcdGetSeq("bsequence"); /* Build list of sequences */ list = ajListNew(); ajListPush(list, (void *)seq1); ajListPush(list, (void *)seq2); seqset = ajSeqsetNew(); nset = ajSeqsetFromList(seqset, list); /* Do something with set */ ajSeqsetDel(&seqset); ajListFree(&list);
You can append to a sequence set using ajSeqsetApp
, which copies the sequences:
AjPSeq seqa = NULL; AjPSeq seqb = NULL; AjPSeqset seqset = NULL; ajint nset = 0; /* Number of sequences in set */ seq1 = ajAcdGetSeq("asequence"); seq2 = ajAcdGetSeq("bsequence"); seqset = ajSeqsetNew(); ajSeqsetApp(seqset, seqa); ajSeqsetApp(seqset, seqb); /* Do something with set */ ajSeqDel(&seqa); ajSeqDel(&seqb); ajSeqsetDel(&seqset);
An output sequence object can be created from an open file. In this case an output file isn't created during ACD file processing. The file is appropriated by the function so only one destructor needs to be called:
AjPFile file = NULL; AjPSeqout seqout = NULL; /* Open a file called "InputFileName" */ file = ajFileNewInNameC ("InputFileName"); /* Create an output sequence object that uses this file */ seqout = ajSeqoutNewFile(file); /* Do something with the output sequence object */ ajSeqoutDel(&seqout);
An existing output sequence object can be loaded (and a file opened for writing) by using ajSeqoutOpenFilename
which, given a valid filename, returns a boolean which is ajTrue
if the file was opened ok:
AjPSeqout seqout = NULL; AjPStr name = NULL; /* Create an empty seqout object */ seqout = ajSeqoutNew(); ajStrAssignC(&name, "InputFileName"); if(!ajSeqoutOpenFilename(seqout, name)) /* Error handling */; /* Do something with the output sequence object */ ajSeqoutDel(&seqout);
Functions are provided to load the sequence input objects with one or more sequences, or initialise the output sequence object, given a USA (see the EMBOSS Users Guide) of an input or output sequence(s) or sequence stream.
A sequence can be loaded into a sequence object using ajSeqGetFromUsa
. The second argument is true if the sequence is protein, and the function returns ajTrue
if the sequence was loaded successfully:
AjPSeq seq = NULL; AjPStr usa = NULL; usa = ajStrNewC("asis::ALVLLYNITEF"); /* "asis::ALVLLYNITEF" is a valid USA */ seq = ajSeqNew(); /* Create sequence object */ if(!ajSeqGetFromUsa(usa, 1, &seq)) /* Do something to handle the failure */; else /* Sequence was loaded successfully .. do something with it */; ajSeqDel(&seq);
A sequence set can be created using ajSeqsetGetFromUsa
in the same way as ajSeqGetFromUsa
:
AjPSeq seqset = NULL; AjPStr usa = NULL; seqset = ajSeqsetNew(); /* Create sequence object */ usa = ajStrNewC("swiss:hba_*"); /* ""swiss:hba_*"" is a valid USA */ if(!ajSeqsetGetFromUsa(usa, 1, &seqset)) /* Do something to handle the failure */; else /* Sequence set was loaded successfully .. do something with it */; ajSeqsetDel(&seqset);
A sequence input stream is created from a USA using ajSeqallFile
, which in contrast to the previous two functions returns a pointer to the new object. This function will be renamed in the future to reflect the fact that it is a constructor function:
AjPSeq seqall = NULL; AjPStr usa = NULL; usa = ajStrNewC("swiss:hba_*"); /* ""swiss:hba_*"" is a valid USA */ if(!(seqall=ajSeqallFile(usa))) /* Do something to handle the failure */; else /* Sequence input stream was opened successfully .. do something with it */; ajSeqallDel(&seqall);
ajSeqoutClearUsa
creates (if necessary) or resets a sequence output object using a new USA:
AjPSeqout seqout = NULL; AjPStr usa = NULL; usa = ajStrNewC("asis::ALVLLYNITEF"); /* "asis::ALVLLYNITEF" is a valid USA */ seqout = ajSeqoutNew(); ajSeqoutClearUsa (&seqout, usa); ajSeqoutDel(&seqout);
The code would work equally well if the call to the constructor function was omitted.
AJAX functions are provided to recover or to set elements (data fields) in the basic sequence objects.
Functions to assign a value to one element of an existing sequence object have the prefix ajSeqAssign
.
In the following example, ajSeqAssignDescC
and ajSeqAssignNameS
are used to assign a name and description to a region (user-defined by -sbegin
and -send
values) of the input sequence:
AjPSeq seq = NULL; AjPSeq seqsub = NULL; seq = ajAcdGetSeq("sequence"); seqsub = ajSeqNewSeq(seq); ajSeqTrim(seqsub); ajSeqAssignNameC(seqsub, "Subsequence"); ajSeqAssignDescC(seqsub, "Subsequence defined by sbegin and send values."); ajSeqAssignAccC(seqsub, "YZ456789");
There are assignment functions for other sequence object elements: ajSeqAssignEntryC
(entry name), ajSeqAssignFileC
(file name), ajSeqAssignFullC
(full sequece name), ajSeqAssignGiC
(NCBI GI number), ajSeqAssignSeqC
(sequence as a string), ajSeqAssignSvC
(EMBL sequence version), ajSeqAssignUfoC
(uniform feature object string for feature input), ajSeqAssignUsaC
(sequence input uniform sequence address, e.g.
) and versions using string parameters with the suffix filename
:entryname
S
instead of C
.
The ajSeqSet*
family are functions are used to set other sequence properties (not single elements). For example ajSeqSetRange
sets the range of a sequence object, in this case to the first 10 residues:
AjPSeq seq=NULL; ajint start=0; ajint end=9; seq = ajSeqNew(); ... ajSeqSetRange(seq, start, end);
There are set functions for other sequence object elements: ajSeqSetOffsets
(sets the offset from the start, and the full length to define the offset at the end), ajSeqSetRangeRev
(sets the range as above and sets the sequence to be reversed), ajSeqSetunique
(makes a sequence unique so that the sequence characters can be modified without side effects)
"Set" functions are available for the other datatypes too:
AjPSeqset seqset=NULL; AjPSeqset seqall=NULL; ajint start = 0; ajint end = 9; seqset = ajSeqNew(); seqall = ajSeqNew(); ... ajSeqsetSetRange(seq, start, end); ajSeqallSetRange(seq, start, end);
These are the only useful functions at the sequence stream and sequence set level. Other properties are set for each individual sequence in the stream ort set using the sequence "Set" or "Assign" functions.
The ajSeqGet*
family of functions are used to return the contents of a sequence object. The functions return pointers to the object's actual contents so the caller must take care not to unintentionally change the returned value in any way. Here, two flavours of function are used. ajSeqGetDescC
returns the sequence description whereas ajSeqGetNameS
returns the sequence name. The suffix "S
" is used when an EMBOSS string is returned whereas "C
" indicates a C-type (char *
) string is returned:
AjPSeq seq = NULL; AjPStr str = NULL; AjPStr strcopy = NULL; char * strc = NULL; seq = ajAcdGetSeq("sequence"); strc = ajSeqGetDescC(seq); str = ajSeqGetNameS(seq); /* Do something (but don't change) the returned strings */ ajStrAssignS(&strcopy, str); /* Copy of string is now safe to change ... do something with it */ ajSeqDel(&seq); ajStrDel(&str);
All the common sequence attributes can be retrieved like this, for example the accession and GI numbers using:
const AjPStr ajSeqGetGiS(const AjPSeq seq) const AjPStr ajSeqGetAccS (const AjPSeq seq);
The sequence itself is retrieved using one of:
const char* ajSeqGetSeqC (const AjPSeq seq); const AjPStr ajSeqGetSeqS (const AjPSeq seq);
The function ajSeqGetOffset
will return the -sbegin
value the user specified, i.e. the offset to the start of the sequence. ajSeqGetOffend
will return the number of positions to be removed from the end as defined by the sequence length and the -send
value:
AjPSeq seq = NULL; ajint offset; ajint offend; seq = ajAcdGetSeq("sequence"); offset = ajSeqGetOffset(seq); offend = ajSeqGetOffend(seq); /* Do something here using offset and offend, perhaps code to retain the trimmed regions. */ ajSeqTrim(seq);
"Get" functions are available for the other datatypes. For instance to get the number of sequences and range of a sequence set, or the filename and USA of a sequence input stream:
AjPSeq seqset = NULL; AjPSeqall seqall = NULL; ajint start; ajint end; ajint nseq; AjPStr name = NULL; AjPStr usa = NULL; seqset = ajAcdGetSeqset("sequence"); seqall = ajAcdGetSeqall("sequences"); nseq = ajSeqsetGetSize(seqset); ajSeqsetGetRange(seqset, &start, &end); name = ajSeqallGetFilename(seqall); usa = ajSeqallGetUsa(seqall);
There are "Get" functions that return a property of a sequence in a set. These have the prefix ajSeqsetGetseq
. The number of the required sequence is specified. For instance three functions are provided to return a given sequence in a set, either as a sequence object pointer, an EMBOSS string or as a C-type (char *
) string. Again, care should be taken not to unwittingly change the object data in any way:
AjPSeq seqset = NULL; AjPSeq seqPtr AjPStr seqS; char * seqC; seqset = ajAcdGetSeqset("sequence"); /* First sequence in call cases */ seqPtr = ajSeqsetGetseqSeq (seqset, 1); seqS = ajSeqsetGetseqSeqS (seqset, 1); seqC = ajSeqsetGetseqSeqC (seqset, 1);
Functions to return a property of a sequence in a stream have the prefix ajSeqallGetseq
and return properties of the current sequence in the stream. For instance, to get the name and length of the current sequence:
AjPSeq seqall = NULL; AjPStr name=NULL; ajint len; seqall = ajAcdGetSeqall("sequences"); name = ajSeqallGetseqName (seqall); len = ajSeqallGetseqLen(seqall)
Sequence properties can be tested, e.g. for whether a sequence is protein or nucleotide. Such functions return AjBool
:
AjPSeq seq = NULL; AjPSeqset seqset = NULL; seq = ajAcdGetSeq("sequence"); seqset = ajAcdGetSeq("sequences"); if(ajSeqIsProt(seq)) /* Protein sequence */; else if(ajSeqIsNuc(seq)) /* Nucleotide sequence */; else /* Error handling */; if(ajSeqsetIsNuc(seqset) { if(ajSeqsetIsRna (seqset)) /* Do something. */; } else if(ajSeqsetIsNuc(seqset) { /* Do something else */ }
Such tests are seldom required however as the sequence type can be enforced at the level of the ACD file.
Various properties can be calculated for a sequence. For example, ajSeqCalcCount
writes an array with the number of A
, C
, G
and T
nucleotides in a nucleotide sequence, whereas ajSeqCalcMolwt
calculates the molecular weight of a protein sequence:
AjPSeq seq = NULL; ajint nbases[5]; float molwt; seq = ajAcdGetSeq("sequence"); if(ajSeqIsNuc(seq)) ajSeqCalcCount(seq, *nbases); else if(ajSeqIsProt(seq)) molwt = ajSeqCalcMolwt(const AjPSeq seq);
Note that various calculated attributes of sequences (Section A.6, “Calculated Attributes”), such as length and type, are available at the level of the ACD file.
There are two functions to copy a sequence into a string, using as a target either an EMBOSS string or a C-type (char *
) string:
char* ajSeqGetSeqCopyC (const AjPSeq seq); AjPStr ajSeqGetSeqCopyS (const AjPSeq seq);
There are a few functions for processing sequences that are held as strings using an AjPStr
object. These have the prefix ajSeqstr
. Here ajSeqGetSeqCopyS
is first used to copy the sequence into a string. The molecular weight of a protein sequence is calculated, or the reverse and complement of a nucleotide sequence:
AjPSeq seq = NULL; AjPStr str = NULL; float molwt; seq = ajAcdGetSeq("sequence"); str = ajSeqGetSeqCopyS(seq); if(ajSeqIsNuc(seq)) ajSeqstrReverse(&str); else if(ajSeqIsProt(seq)) molwt = ajSeqCalcMolwt(str);
Various functions for processing a sequence are provided. You've already come across ajSeqTrim
and ajSeqsetTrim
for trimming sequences to their start and end regions but other operations are possible. For example, gap characters can be processed using the following functions:
void ajSeqGapFill (AjPSeq seq, ajuint len); void ajSeqGapStandard (AjPSeq seq, char gapchar); void ajSeqGapLen (AjPSeq thys, char gapc, char padc, ajint ilen);
ajSeqGapFill
appends gap characters to a sequence up to a specified maximum sequence length. ajSeqGapStandard
makes all the gaps in a string use the standard gap character provided. Any dash, tilde, full stop, question mark or blank space characters (- ~ . ?
) in the string are converted. ajSeqGapLen
sets any non-sequence characters to valid gap characters, and pads with extra gaps if necessary to a specified length.
To count the gaps in a sequence (counting any possible gap character) call ajSeqCountGaps
:
ajuint ajSeqCountGaps (const AjPSeq seq);
Functions are available to complement, or reverse and complement a nucleotide sequence, if it hasn't already been done from ACD or command-line processing:
void ajSeqComplement (AjPSeq seq); void ajSeqReverseDo (AjPSeq seq)
There are a few functions to convert a sequence, or all sequences in a set, to lower or upper case:
void ajSeqsetFmtLower (AjPSeqset seq); void ajSeqsetFmtUpper (AjPSeqset seq); void ajSeqFmtLower (AjPSeq thys); void ajSeqFmtUpper (AjPSeq thys);
ajSeqsetFill
can be used to fill a sequence set by putting gaps at the ends of any short sequences:
ajint ajSeqsetFill(AjPSeqset seq)
In most cases the validation of sequence type (see Section A.7, “Sequence Types”), i.e. whether a sequence is for a protein or nucleic acid, contains gaps and so on, is handled at the level of the ACD file. This means that the application always receives (via a call to ajAcdGet*
) a sequence of the exact type specified in the ACD file.
Several functions are provided in AJAX for handling sequence type directly. The type of a sequence can be set automatically, i.e. determined from the sequence itself, by calling ajSeqType
. Otherwise it can be set directly for a sequence, or a sequence set, to one of nucleotide
or protein
using the following functions:
void ajSeqType (AjPSeq thys); void ajSeqSetNuc (AjPSeq thys); void ajSeqSetProt (AjPSeq thys); void ajSeqsetSetNuc (AjPSeqset thys); void ajSeqsetSetProt (AjPSeqset thys);
It is sometimes necessary to convert a sequence into numeric form for convenient processing. The AJAX datatype for this is:
AjPSeqCvt
Used for sequence conversion into numeric form.
The common use-case is where a sequence character conversion table is generated for a scoring matrix, such as an amino acid substitution scoring matrix (see Section 6.10, “Handling Comparison Matrices”).
Translation of a genetic sequence into a protein is a common task and is covered elsewhere (Section 6.8, “Handling Sequence Translation”).
EMBOSS supports the official IUB base codes as shown below and also X
to indicate an unknown base.
Unambiguous base codes are:
A
(Adenine)
C
(Cytosine)
G
(Guanine)
T
(Thymine) or U
(Uracil)
Either of two bases possible:
M
(A
or C
)
R
(A
or G
)
W
(A
or T
/U
)
S
(C
or G
)
Y
(C
or T
/U
)
K
(G
or T
/U
)
Any of three bases possible:
B
(C
G
or T
/U
)
D
(A
G
or T
/U
)
H
(A
C
or T
/U
)
V
(A
C
or G
)
Any of the four bases possible:
N
(A
C
G
or T
/U
)
An unknown (or non-standard) base:
X
(not A
C
G
or T
/U
)
The standard IUB base codes have 4 possibilities which can be represented in binary as 4 bits for rapid comparison including testing of ambiguity codes. Functions are provided for the character and binary representations of base codes.
The ambiguity codes and their numeric representations are read from local data file Ebases.iub
/* Sets up binary OR representation of all IUB bases in a table aj_base_table where A=1, C=2, G=4 and T=8. Also sets up a match probability array aj_base_prob holding the probability of one IUB base matching any other. Uses the Ebases.iub file. Is initialised if necessary from other AJAX functions. */ /* Complements a nucleotide base. */ char ajBaseAlphacharComp(char base); /* Returns an element of the base match probability array */ float ajBaseAlphaCompare(ajint base1, ajint base2);
To clean up internal memory used for housekeeping of sequence base processing call ajBaseExit
:
void ajBaseExit (void);
This is called automatically by embExit
.
Functions for converting an integer representation of a base code into something else:
/* Convert to string of matching base code(s). */ const AjPStr ajBaseGetCodes (ajint ibase); /* Convert to base code: 0 to 'A', 1 to 'B' ... 25 to 'Z'. Returns an integer. */ ajint ajBasecodeFromInt (ajint n);
To convert base code into something else:
/* Convert to integer: 'A' to 0, 'B' to 1, ... 'Z' to 25. Returns an integer. */ ajint ajBasecodeToInt (ajint c); /* Convert to binary OR representation of an IUB base. */ ajint ajBaseAlphaToBin (ajint c); /* Convert to binary OR representation of an IUB base. */ char ajBaseAlphacharToBin (char c);
ajBaseAlphaToBin
and ajBaseAlphacharToBin
return a binary OR representation of an IUB base where A=1, C=2, G=4 and T=8. These functions use an internal base table.
To convert a binary OR representation of an IUB base into something else:
/* Converts a binary OR representation of an IUB base to an ambiguous DNA base code. */ char ajBaseBinToAlpha (ajint c);
ajBaseBinToAlpha
converts a binary OR representation of an IUB base where A=1, C=2, G=4 and T=8 into an ambiguous DNA base code (uses T rather than U). It uses the internal base table.
To interconvert single and three letter amino acid codes:
/* Convert single letter to three letter amino acid code. */ AjBool ajResidueToTriplet (char c, AjPStr *Paa3); /* Convert three letter to single letter amino acid code. */ AjBool ajResidueFromTriplet (char *aa4, char *Pc);
Various functions for calculating the properties of nucleotide sequences are provided in ajdan.h/c
. These functions use a DNA melting object (AjPMelt
) but this is transparent to the application programmer, you will never need to instantiate or use the object directly:
AjPMelt
Used for DNA melting.
The AjPMeltInit
and AjPMeltExit
functions are used internally to initialise and clean up. There is no need to call them directly. The AjPMeltGC
function is a utility to calculate G+C content.
/* Initialises melt entropies, enthalpies and energies. Different data files are read for DNA or RNA heteroduplex. Also sets optional flag for array saving of the above. */ void ajMeltInit (AjBool isdna, ajint savesize); /* Cleans up DNA melting processing internal memory */ void ajMeltExit (void); /* Calculates GC fraction of a sequence allowing for ambiguity */ float ajMeltGC (const AjPStr strand, ajint len);
AjPMeltEnergy
calculates melt energy and enthalpy/entropy for a sequence string. AjPMeltEnergy2
is an alternative interface that initialises a set of 'save' arrays to make later calls more efficient by inspecting the sequence only once. AjPMeltTemp
calculates a melting temperature for an oligonucleotide, with a AjPMeltTempSave
version that uses 'savce' arrays for efficiency. AjPMeltTempProd
calculated the melting temperature of an oligonucleotide product under a given salt concentration. AjPMeltTemp
caclulates the annealing temperature of a product and primer, given their respective melting temperatures.
float ajMeltEnergy(const AjPStr strand, ajint len, ajint shift, AjBool isDNA, AjBool maySave, float *enthalpy, float *entropy); float ajMeltEnergy2(const char *strand, ajint pos, ajint len, AjBool isDNA, float *enthalpy, float *entropy, float **saveentr, float **saveenth, float **saveener); float ajMeltTemp (const AjPStr strand, ajint len, ajint shift, float saltconc, float DNAconc, AjBool isDNA); float ajMeltTempSave(const char *strand, ajint pos, ajint len, float saltconc, float DNAconc, AjBool isDNA, float **saveentr, float **saveenth, float **saveener); float ajMeltTempProd (float gc, float saltconc, ajint len); float ajAnneal (float tmprimer, float tmproduct);