The EMBOSS application matcher generates pairwise local alignments given either DNA or protein sequences. It has different options depending on the type of input. When wrapping such programs, for instance for incorporation into a graphical user interface, it is often convenient to split them into separate nucleotide and protein applications with the same application code underneath. SRS is a good example of a system which wraps EMBOSS applications in this way.
To deepen your knowledge, use matcher as a template to produce new, separate applications for protein (matcherpro) and nucleotide (matchernuc) sequences.
You need to change how the matcher functionality is presented to the user, therefore you'll need two new ACD files and a copy of the matcher source code for each. Both new applications will effectively use the same source code with only trivial differences.
The steps to create these applications are as follows:
Create the application source code (matcherpro.c
and matchernuc.c
files) in:
.../embassy/myemboss/src |
Copy the source code for matcher from .../emboss/matcher.c
.
Create the application ACD files (matcherpro.acd
and matchernuc.acd
) in:
.../embassy/myemboss/emboss_acd |
Use .../emboss/matcher.acd as a template.
Add the applications to the myemboss
package by updating the two Makefile.am
files:
.../embassy/myemboss/src/Makefile.am |
.../embassy/myemboss/emboss_acd/Makefile.am |
Compile and test the applications.
The original matcher.acd
is shown below (help:
attributes are not shown):
application: matcher [ documentation: "Waterman-Eggert local alignment of two sequences" groups: "Alignment:Local" ] section: input [ information: "Input section" type: "page" ] sequence: asequence [ parameter: "Y" type: "any" ] sequence: bsequence [ parameter: "Y" type: "@($(acdprotein) ? stopprotein : nucleotide)" ] matrix: datafile [ additional: "Y" information: "Matrix file" protein: "$(acdprotein)" ] endsection: input section: additional [ information: "Additional section" type: "page" ] integer: alternatives [ additional: "Y" information: "Number of alternative matches" default: "1" minimum: "1" ] integer: gapopen [ additional: "Y" information: "Gap penalty" default: "@($(acdprotein)? 14 : 16)" minimum: "0" valid: "Positive integer" expected: "14 for protein, 16 for nucleic" ] integer: gapextend [ additional: "Y" information: "Gap length penalty" default: "@($(acdprotein)? 4 : 4)" minimum: "0" valid: "Positive integer" expected: "4 for any sequence" ] endsection: additional section: output [ information: "Output section" type: "page" ] align: outfile [ parameter: "Y" aformat: "markx0" minseqs: "2" maxseqs: "2" ] endsection: output
The ACD file introduces several new concepts:
The groups:
attribute in the application definition assigns the application to a group (see Section 4.2, “Application Definition”).
The sequence
ACD datatype is used to define two input sequences; asequence
and bsequence
.
The sequence type of asequence
is set by the type:
attribute, in this case to "any
", i.e. any type of sequence is acceptable.
For bsequence
, the sequence type is calculated from the ACD variable acdprotein
; if acdprotein
is true
then type:
is set to stopprotein
, otherwise it's set to nucleotide
.
acdprotein
is an "automatic ACD variable" with a boolean
type whose value is set automatically when the first sequence is read in. So, if the first sequence is a protein, then acdprotein
will be true
. Automatic ACD variables are described in detail elsewhere (Section 4.4, “Operations”).
The matrix
datatype is used to define a substitution matrix (called matrixfile
). EMBOSS will search for this data file in the EMBOSS data directory (see the EMBOSS Users Guide).
information:
is used to set a user-prompt for some of the data items. It is not needed for the sequence inputs (EMBOSS will automatically ghenerate a suitable prompt) but can be given for the other types used (see Section 4.3, “Data Definition”).
Qualifiers and parameters in the ACD file are organised into sections (input
, additional
and output
). These help to tidy the ACD file and are exploited by user interfaces (see Section 4.3, “Data Definition”).
Options in the additional
section are defined to be 'additional qualifiers' by the attribute additional: "Y"
. Values for additional qualifiers are not prompted for (the default value will be used instead) unless -options
is given on the command line, which will turn prompting on for these qualifiers (see Section 4.1, “Introduction to ACD File Development” for more information.)
Two gap penalties (gapopen
and gapextend
) are defined as integer
ACD types. The minimum:
, valid:
and expected
attributes are used to set minimum and expected values and a corresponding message to the user.
There is a single output, a sequence alignment (outfile
) which is defined by the type align
. The format (markx0
) and minimum and maximum number of sequences (2 in both cases, i.e. a pairwise alignment) are set using the attributes aformat:
, minseqs:
and maxseqs
respectively.
The changes necessary for matcherpro.acd
are:
The application name should be changed to matcherpro
.
The documentation:
attribute should state that the application works on protein sequences only.
The type:
attribute of the first input sequence should be changed from any
to protein
.
The type
of the second input sequence should be stopprotein
.
The residue substitution matrix should be of type protein. Currently this is given as protein: "$(acdprotein)"
which means that the protein:
attribute will be set to true
if the first sequence is a protein. $(acdprotein)
should be replaced with y
.
All other occurrences of lines containing acdprotein
should be replaced with the protein term; the first value in such lines containing ACD @
function calls.
The parts of matcherpro.acd
which have been modified and differ from matcher.acd
are shown below:
application: matcherpro [ documentation: "Waterman-Eggert local alignment of two sequences" groups: "Alignment:Local" ] ... lines omitted sequence: asequence [ parameter: "Y" type: "protein" ] sequence: bsequence [ parameter: "Y" type: "stopprotein" ] ... lines omitted matrix: datafile [ additional: "Y" information: "Matrix file" protein: "Y" ] ... lines omitted integer: gapopen [ additional: "Y" information: "Gap penalty" default: "14" minimum: "0" valid: "Positive integer" expected: "14" ] integer: gapextend [ additional: "Y" information: "Gap length penalty" default: "4" minimum: "0" valid: "Positive integer" expected: "4 for any sequence" ] ... lines omitted
The main()
function for matcher is shown below. The application includes several functions and macros that are not shown:
#include "emboss.h" /* @prog matcher ************************************************************** ** ** Finds the best local alignments between two sequences ** ******************************************************************************/ int main(int argc, char **argv) { AjPStr aa0str = 0; AjPStr aa1str = 0; const char *s1; const char *s2; ajint gdelval; ajint ggapval; ajuint i; ajint K; AjPAlign align = NULL; embInit("matcher", argc, argv); seq = ajAcdGetSeq("asequence"); ajSeqTrim(seq); seq2 = ajAcdGetSeq("bsequence"); ajSeqTrim(seq2); matrix = ajAcdGetMatrix("datafile"); K = ajAcdGetInt("alternatives"); gdelval = ajAcdGetInt("gapopen"); ggapval = ajAcdGetInt("gapextend"); align = ajAcdGetAlign("outfile"); /* create sequence indices. i.e. A->0, B->1 ... Z->25 etc. This is done so that ajBasecodeToInt has only to be done once for each residue in the sequence */ ajSeqFmtUpper(seq); ajSeqFmtUpper(seq2); s1 = ajStrGetPtr(ajSeqGetSeqS(seq)); s2 = ajStrGetPtr(ajSeqGetSeqS(seq2)); sub = ajMatrixGetMatrix(matrix); cvt = ajMatrixGetCvt(matrix); aa0str = ajStrNewRes(2+ajSeqGetLen(seq)); /* length + blank + trailing null */ aa1str = ajStrNewRes(2+ajSeqGetLen(seq2)); ajStrAppendK(&aa0str,' '); ajStrAppendK(&aa1str,' '); for(i=0;i<ajSeqGetLen(seq);i++) ajStrAppendK(&aa0str,(char)ajSeqcvtGetCodeK(cvt, *s1++)); for(i=0;i<ajSeqGetLen(seq2);i++) ajStrAppendK(&aa1str,ajSeqcvtGetCodeK(cvt, *s2++)); matcher_Sim(align, ajStrGetPtr(aa0str),ajStrGetPtr(aa1str), seq,seq2, K,(gdelval-ggapval),ggapval, ajSeqGetOffset(seq), ajSeqGetOffset(seq2), 2); ajStrDel(&aa0str); ajStrDel(&aa1str); ajSeqDel(&seq); ajSeqDel(&seq2); embExit(); return 0; }
matcher.c
introduces many new concepts:
Two AJAX strings (aa0str
and aa1str
) are defined using AjPStr
.
An alignment object (align
) is created using AjPAlign
.
The values from the ACD file are picked up by calls beginning with ajAcdGet
. The functions are named to make the types returned obvious, ajAcdGetSeq
returns sequences, ajAcdGetMatrix
returns a matrix and so on.
The next block of code creates the sequences indices. You needn't concern yourself with the details. Functions for this include ajMatrixGetMatrix
, ajMatrixGetCvt
and ajSeqcvtGetCodeK
(see the library documentation for an explanation). The function matcher_Sim
(not shown) is also called and is a function local to matcher.
Strings and sequences that were allocated are freed by calling ajStrDel
and ajSeqDel
.
The application then exits cleanly by calling embExit
.
One small edit is needed in the src/matcherpro.c
file:
Change the embInit
call to embInitP
and add "myemboss"
as the last argument so EMBOSS knows you are in the myemboss
package
Change the program name in the embInitP
call to "matcherpro"
so that it uses the new ACD file.
Change the program name in the main
function comment call to "matcherpro"
to match the program name.
The parts of matcherpro.c
which have been modified and differ from matcher.c
are shown below:
/* @prog matcher **************************************************************; embInitP("matcherpro", argc, argv, "myemboss");
To compile and install the application from the myemboss
directory, type:
make |
Or, if you are using myemboss as part of a fully installed EMBOSS system then type:
make install |
matcherpro has just been added to your system, therefore you may need to make it visible in your path, for example, by running the UNIX command rehash
, if you use a csh
shell.
To check all is well, you can run:
acdvalid matcherpro |
and check that no errors or warnings are reported. As matcher.acd
has already been validated it should be fine.
You should now have a new application matcherpro which should produce the same results as matcher with protein input, but will refuse to run with DNA sequences. This is shown below:
%
matcherpro
Finds the best local alignments between two protein sequences Input protein sequence: embl:L07770 Error: Sequence is not a protein Error: Unable to read sequence 'embl:L07770' Input protein sequence: ...
Here it is being used correctly:
%
matcherpro
Finds the best local alignments between two protein sequences Input protein sequence: swissprot:UBR5_RAT Second protein sequence: ...
From the above it should be obvious what's required to implement matchernuc.acd
and matchernuc.c
, so generating two distinct applications, one for nucleotide and one for protein sequences.
If you are familiar with EMBOSS, you might notice that the output format of matcherpro is not the same as water, another popular alignment program. This could be confusing to users, so as a final modification you can make matcherpro consistent with water. To do this, set the aformat:
attribute of the outfile
data definition, which currently has the value "markx0"
to the value ("srspair"
) defined in the water.acd
file.
A similar effect could be achieved by running matcherpro with -aformat srspair
on the command line. If for some reason the old format was required, matcherpro could be run with -aformat markx0
on the command line.