Matrices are commonly used in molecular sequence analysis to compare seque nce characters at the same position in two or more aligned sequences.
Matrix objects are created by reading a matrix local data file, either through ACD (where the user has a choice of files), or by directly reading a named file where the filename is fixed.
EMBOSS includes sets of comparison matrix files in the data directory which can be used as examples when creating new files.
Matrix objects are in two very similar forms, using integers (AjPMatrix
) for speed, and floating point numbers (AjPMatrixf
) for flexibility. Both types include a set of column labels (usually sequence characters), and a set of row labels which usually matches the column labels (asymmetric matrices are used in some applications). For rows and columns there is also a matrix size value. The numbers in the data file become a two-dimensional table of comparison values.
For sequence alignments functions are provided to use a matrix object to align two sequences.
Applications that analyse sequence alignments, for example prettyplot, can directly use the conversion table and character codes in a matrix to look up comparison scores using an AjPSeqcvt
object.
AJAX library files for handling matrices are listed in the table (Table 6.15, “AJAX Library Files for Handling Matrices”). Library file documentation, including a complete description of datatypes and functions, is available at:
http://emboss.open-bio.org/rel/dev/libs/ |
Library File Documentation | Description |
---|---|
ajmatrices | Comparison matrix handling functions |
ajmatrices.h/c
. Defines the AjPMatrix
and AjPMatrixf
objects and functions for handling comparison matrices.
There are two datatypes for handling comparison matrix input:
There are two datatypes for handling comparison matrix output:
outmatrix
Output integer comparison matrix.
outmatrixf
Output floating point comparison matrix.
Typical ACD definitions for comparison matrix input:
#Integer matrix (input) matrix: matrix [ information: "Matrix file" protein: "$(acdprotein)" ] # Floating point matrix (input) matrixf: matrixf [ information: "Matrix file" protein: "$(acdprotein)" ]
Typical ACD definitions for comparison matrix output:
# Integer matrix (output) outmatrix: outmatrix [ information: "Matrix file" protein: "$(acdprotein)" ] # Floating point matrix (output): outmatrixf: outmatrixf [ information: "Matrix file" protein: "$(acdprotein)" ]
All data definitions for comparison matrix input and output should have a standard parameter name, which is matrix
. For further information see Appendix A, ACD Syntax Reference.
Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).
information:
A global attribute. It specifies the user prompt and is used in the application documentation.
protein:
A boolean attribute which if set specifies that the matrix is for proteins. If not set the matrix is presumed to be for nucleic acids.
For handling comparison matrices, including input matrices defined in the ACD file, use:
AjPMatrix
Integer comparison matrix (for matrix
ACD datatype).
AjPMatrixf
Floating point comparison matrix (for matrixf
ACD datatype).
For handling comparison matrix output use:
AjPOutfile
General output file (for outmatrix
and outmatrixf
ACD datatypes).
It is sometimes necessary to convert a sequence into numerical form for convenient processing. The AJAX datatype for this is:
AjOSeqCvt
Used for sequence conversion into numerical form.
Datatypes and functions for handling comparison matrices via the ACD file are shown below (Table 6.16, “Datatypes and Functions for Comparison Matrix Input and Output”).
ACD datatype | AJAX datatype | To retrieve from ACD |
---|---|---|
Comparison Matrix Input | ||
matrix | AjPMatrix | ajAcdGetMatrix |
matrixf | AjPMatrixf | ajAcdGetMatrixf |
Comparison Matrix Output | ||
outmatrix | AjPMatrix | ajAcdGetOutmatrix |
outmatrixf | AjPMatrixf | ajAcdGetOutmatrixf |
Your application code will call embInit
to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet*
family of functions which return pointers to appropriate objects.
To retrieve a comparison matrix, an object pointer is declared and then initialised using the appropriate ajAcdGet*
function.
To retrieve an output comparison matrix an object pointer is declared and initialised using the appropriate ajAcdGet*
function.
AjPOutfile outmatrix=NULL; outmatrix = ajAcdGetOutmatrix("outmatrix");
Currently there are no functions for this.
It is your responsibility to close any files and free up memory at the end of the program.
You must close the output file for any outmatrix
or outmatrixf
definitions in the ACD file by calling ajOutfileClose
with the address of the output file object:
AjPOutfile outmatrix = NULL; AjPOutfile outmatrixf = NULL; outmatrix = ajAcdGetOutmatrix("outmatrix"); outmatrixf = ajAcdGetOutmatrixf("outmatrixf"); /* Do something with matrices */ ajOutfileClose(&outmatrix); ajOutfileClose(&outmatrixf);
Matrix objects are usually created through an ACD definition, reading a named EMBOSS local data file. To create a matrix object directly (perhaps where there is no choice of matrix filename) the object must be constructed from a given local data file by calling:
AjBool ajMatrixNewFile (AjPMatrix* pthis, const AjPStr filename); AjBool ajMatrixfNewFile (AjPMatrixf* pthis, const AjPStr filename);
The functions take the name of the data file to open. The file must be found in the EMBOSS data path, including the current directory and the installed data files.
All constructors return the address of a new object. The pointers do not need to be initialised to NULL but it is good practice to do so:
AjPMatrix intmatrix = NULL; AjPMatrixf floatmatrix = NULL; AjPStr filename = NULL; filename = ajStrNewC("EBLOSUM62"); intmatrix = ajMatrixNewFile(filename); floatmatrix = ajMatrixfNewFile(filename);
All constructors return the address of a new object. The pointers do not need to be initialised to NULL
but it is good practice to do so:
You must close any output files and free the memory for your objects once you are finished with them.
To close an output file (AjPOutfile
) call ajOutfileClose
, or call ajFileClose
for general file objects (AjPFile
):
void ajFileClose (AjPFile* Pfile); void ajOutfileClose (AjPOutfile* Pfile);
The objects are freed by calling the destructor functions:
void ajMatrixDel (AjPMatrix *thys); void ajMatrixfDel (AjPMatrixf *thys);
They are used as follows:
AjPMatrix matrix = NULL; AjPMatrix matrixf= NULL; matrix = ajAcdGetMatrix("matrix"); matrixf = ajAcdGetMatrixf("matrixf"); /* Do something with matrices */ ajMatrixDel(&matrix); ajMatrixfDel(&matrixf);
Internally, the matrix object constructor functions are:
AjPMatrix ajMatrixNew (const AjPPStr codes, ajint n, const AjPStr filename); AjPMatrixf ajMatrixfNew (const AjPPStr codes, ajint n, const AjPStr filename);
These will create a new matrix with values initialised to zero. The functions take the matrix name (filename
), a string (codes
) containing characters for the matrix labels, and an integer (n
) that is the number of labels. If the matrix is a residue substitution matrix then the string should contain defined sequence characters.
The matrices that are created by ajMatrixNew
and ajMatrixfNew
are square, having the same number of rows and columns. To create a matrix with an unequal number of rows and columns call:
AjPMatrix ajMatrixNewAsym (const AjPPStr codes, ajint n, const AjPPStr rcodes, ajint rn, const AjPStr filename); AjPMatrixf ajMatrixfNewAsym (const AjPPStr codes, ajint n, const AjPPStr rcodes, ajint rn, const AjPStr filename);
These will create a new matrix with values initialised to zero. The functions take the matrix name (filename
), and two strings (codes
and rcodes
) containing characters for the matrix column and row labels respectively, and two integers (n
and rn
) that are the number of column and row labels.
EMBOSS requires all matrix objects to be loaded from data files. No functions are provided to add or change the matrix object values. The following functions read from an EMBOSS data file. For more information on EMBOSS data files, see the EMBOSS Users Guide.
A matrix can be constructed from a given local data file by calling:
AjBool ajMatrixNewFile (AjPMatrix* pthis, const AjPStr filename); AjBool ajMatrixfNewFile (AjPMatrixf* pthis, const AjPStr filename);
The functions take the name of the data file to open.
Most elements of a matrix object can be retrieved by calling one of the ajMatrixGet*
or ajMatrixfGet*
functions:
To return the comparison matrix as an array of integer or floating point arrays call:
AjIntArray* ajMatrixGetMatrix (const AjPMatrix thys); AjFloatArray* ajMatrixfGetMatrix (const AjPMatrixf thys);
Sequence characters are indexed in this array using the internal sequence conversion table in the matrix. AjIntArray
and AjFloatArray
are defined ajdefine.h
as arrays of C-type integer and floating point numbers:
typedef float* AjFloatArray; typedef int* AjIntArray;
To return the label (sequence character or the column name for an asymmetric matrix) for a matrix row or column in position i
call:
AjPStr ajMatrixGetLabelNum (const AjPMatrix thys, ajint i); AjPStr ajMatrixfGetLabelNum (const AjPMatrixf thys, ajint i);
To return the sequence character conversion table for a matrix call:
AjPSeqCvt ajMatrixGetCvt (const AjPMatrix thys); AjPSeqCvt ajMatrixfGetCvt (const AjPMatrixf thys);
This table converts any character defined in the matrix to a positive integer, and any other character is converted to zero.
To return the character codes for each offset in the matrix call:
AjPStr ajMatrixGetCodes (const AjPMatrix thys); AjPStr ajMatrixfGetCodes (const AjPMatrixf thys);
To return the name of a matrix object (which typically is the filename from which it was read), call:
const AjPStr ajMatrixGetName (const AjPMatrix thys); const AjPStr ajMatrixfGetName (const AjPMatrixf thys);
To return the comparison matrix size (or the number of columns for an asymmetric matrix) call:
ajuint ajMatrixGetSize (const AjPMatrix thys); ajuint ajMatrixfGetSize (const AjPMatrixf thys);
For an asymmetric matrix the number of rows can be returned by calling:
ajuint ajMatrixGetRows (const AjPMatrix thys); ajuint ajMatrixfGetRows (const AjPMatrixf thys);
To convert a sequence to index numbers using the matrix's internal conversion table call:
AjBool ajMatrixSeqIndex (const AjPMatrix thys, const AjPSeq seq, AjPStr* numseq); AjBool ajMatrixfSeqIndex (const AjPMatrixf thys, const AjPSeq seq, AjPStr* numseq);
Sequence characters not defined in the matrix are converted to zero.
These functions handle sequence conversion objects. The basic constructor ajSeqcvtNewStr
uses an array of strings as the column labels. For sequence comparison matrices these strings will be one character each. The other constructors renumbers the codes for the specific expectations of some older legacy code and are not recommended for general use. The most useful functions are those which return the numeric code for a base or residue, and are frequently used to look up a sequence character in a conversion table.
/* constructors with base codes as a string*/ AjPSeqCvt ajSeqcvtNewStr (const AjPPStr bases, ajint n); AjPSeqCvt ajSeqcvtNewC (const char* bases); AjPSeqCvt ajSeqcvtNewNumberC (const char* bases); AjPSeqCvt ajSeqcvtNewEndC (const char* bases); /* asymmetrix conversion table constructor */ AjPSeqCvt ajSeqcvtNewStrAsym (const AjPPStr bases, ajint n, const AjPPStr rbases, ajint rn); /* destructor */ void ajSeqcvtDel (AjPSeqCvt* thys); /* return conversion table length */ ajuint ajSeqcvtGetLen (const AjPSeqCvt thys); /* return numeric code for a residue code (matrix column label) */ ajint ajSeqcvtGetCodeK (const AjPSeqCvt thys, char ch); ajint ajSeqcvtGetCodeS (const AjPSeqCvt thys, const AjPStr ch); /* return numeric code for a column or row in an asymnmetric matrix */ ajint ajSeqcvtGetCodeAsymS (const AjPSeqCvt cvt, const AjPStr str); ajint ajSeqcvtGetCodeAsymrowS (const AjPSeqCvt cvt, const AjPStr str);