The basic function of hmmbuild is to read a multiple sequence alignment file, build a new profile HMM and save the HMM to file. It is called as follows:
ehmmbuild [ |
By default the model is configured to find one or more non-overlapping alignments with the complete model: multiple global alignments with respect to the model and local with respect to the sequence. Various other alignment strategies can be set by using the appropriate option.
One limitation is that the user must provide the full filename of an alignment for the alignfile
option and not an indirect reference to a set of sequences, so a USA (see the EMBOSS Users Guide) is not acceptable. This is because hmmbuild (which ehmmbuild wraps) requires an alignment and does not support USAs.
Differences between the wrapper and the original software are as follows:
Note that the alignfile
parameter (used for input) and the hmmfile
parameter (used for output) are specified in the reverse order in the original HMMER. This is because input parameters are always specified before output parameters in EMBOSS.
Several of the original HMMER options are not supported. -help
is used instead of the -h
HMMER option, and the three HMMER options for setting the alignment strategy are replaced with the -strategy
qualifier which is a list in ACD.
An option is not needed to specify whether or not to append to the output file as this can be set in the ACD file by using the append:
attribute. Arguably this should be a supported option.
The -F
option, to force an existing HMM file to be overwritten, is always set because an application should always generate output if possible and EMBOSS trusts its users to specify files correctly.
The -amino
and -nucleic
options are not needed because they are set in the ACD file by using the type:
attribute.
-informat
is not supported because it would confuse the user, who should not have to specify the format of input files.
The 6 options for specifying the sequence weighting algorithm are all handled by a single -weighting
qualifier defined as a list in the ACD file. This is purely for convenience.
Finally, -verbosity
is used instead of -verbose
.
The start of the ACD file is shown below. Text for the help:
attribute is not shown but is given in the ACD files.
You can see that the alignfile
option is handled by a seqset
sequence input type. As mentioned before, this has to be an alignment file and not a USA referring indirectly to a set of sequences. This limitation could be overcome by first translating the USA into a local file, but this wasn't done for this version of the wrapper:
application: ehmmbuild [ # EMBOSS wrapper to hmmbuild from Sean Eddy's HMMER package # v.2.3.2 documentation: "Build a profile HMM from an alignment." groups: "HMM" gui: "yes" batch: "yes" cpu: "medium" embassy: "hmmernew" ] section: input [ information: "Input section" type: "page" ] seqset: alignfile [ # User must provide the full filename of an alignment, not an # indirect reference to a set of sequences, e.g. a USA is NOT # acceptable. parameter: "Y" type: "gapstopprotein" aligned: "Y" ] ...
There are then three infile
ACD definitions to handle various HMMER input files. All of these are advanced ACD options not normally set by the user. Note that a NULL
default value is set for them which requires the nullok:
attribute to be specified and set to True
.
infile: prior [ # Advanced input file information: "Dirichlet prior file." knowntype: "dirichlet prior" default: "" nullok: "Y" ] infile: null [ # Advanced input file information: "NULL model file" knowntype: "hmmer null model" default: "" nullok: "Y" ] infile: pam [ # Advanced input file information: "PAM file" knowntype: "hmmer matrix file" default: "" nullok: "Y" ] float: pamwgt [ default: "20.0" information: "Weighting for PAM." ] endsection: input
The required section is shown below. As you can see it is quite sparse. It contains a string to specify the name of the HMM and a list which is used to set the alignment strategy. The list replaces 3 individual HMMER options. There is also a default setting so this list has 4 entries in total:
section: required [ information: "Required section" type: "page" ] string: n [ standard: "Y" default: "" information: "Name for this HMM." word: "Y" knowntype: "name" ] list: strategy [ standard: "Y" default: "D" minimum: "1" maximum: "1" values: "D:global-multidomain,F:local-multidomain,G:global-singledomain,S:local-singledomain" delimiter: "," codedelimiter: ":" header: "Alignment preference" information: "Select preference" button: "Y" ] endsection: required
The bulk of the HMMER options are defined as "expert" options in the original HMMER documentation and so are given in the advanced section of the ACD file. These options are not normally set by the user and a default value, taken from the HMMER documentation, is given:
section: advanced [ information: "Advanced section" type: "page" ] integer: pbswitch [ default: "1000" information: "Threshold to switch to position-based weights." ] float: archpri [ default: "0.85" information: "Architecture prior" ] boolean: binary [ default: "N" information: "Write HMM as binary." ] boolean: fast [ default: "N" information: "Work in fast mode" ] float: gapmax [ default: "0.5" information: "Fast mode control" ] boolean: hand [ default: "N" information: "Specify model by hand." ] float: sidlevel [ default: "0.62" information: "Cutoff ID threhold" ]
The sequence weighting algorithm is also specified as an advanced ACD qualifier. This one list replaces the 6 command line options given in the original HMMER:
boolean: noeff [ default: "N" information: "Turn off the effective sequence number calculation." ] float: swentry [ default: "0.5" information: "Probability control for local entries" ] float: swexit [ default: "0.5" information: "Probability control for exits" ] boolean: verbosity [ default: "N" information: "Verbosity." ] list: weighting [ default: "G" minimum: "1" maximum: "1" values: "B:Blosum, G:Gerstein/Sonnhammer/Chothia, K:Krogh/Mitchison, W:Henikoff, V:Sibbald/Argos Voronoi, N:None" delimiter: "," codedelimiter: ":" header: "Weighting method" information: "Select weighting" button: "Y" ] endsection: advanced
The output section is shown here. This contains the new parameter defined for the HMM output file, which was written directly to stdout, and two other output files used by HMMER.
section: output: [ information: "Output section" type: "page" ] outfile: hmmfile [ parameter: "Y" knowntype: "hmm file" append: "Y" ] outfile: o [ nullok: "Yes" nulldefault: "Yes" information: "Resave starting alignment." knowntype: "selex file" ] outfile: cfile [ nullok: "Yes" nulldefault: "Yes" information: "Emission and transition count file" knowntype: "hmmer count file" ] endsection: output
The start of the file of C source code is shown below. This just shows the standard documentation that should be given for any EMBOSS application. There is also a line (#include emboss.h
) to import the AJAX and NUCLEUS library interfaces:
/* @source ehmmbuild application ** ** EMBOSS wrapper to hmmbuild from Sean Eddy's HMMER package v.2.3.2 ** Build a profle HMM from an alignment. ** ** @author Copyright (C) Jon Ison ** @@ ** ** This program is free software; you can redistribute it and/or ** modify it under the terms of the GNU General Public License ** as published by the Free Software Foundation; either version 2 ** of the License, or (at your option) any later version. ** ** This program is distributed in the hope that it will be useful, ** but WITHOUT ANY WARRANTY; without even the implied warranty of ** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ** GNU General Public License for more details. ** ** You should have received a copy of the GNU General Public License ** along with this program; if not, write to the Free Software ** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. ******************************************************************************/ #include "emboss.h"
This shows the main()
function and the variable declarations. All the variables for handling the ACD data items have the same name as the corresponding qualifier. This style is not enforced but it is recommended because it makes reading the source code much easier.
Housekeeping variables are given sensible names. All variables are initialised to NULL
or zero. It is good practice to do that, in fact dangerous not to, because some parts of the EMBOSS libraries assume that non-NULL
pointers have had memory assigned to them; if there is a junk value assigned to them at run time then you may be heading for a segmentation fault.
/* @prog ehmmbuild *********************************************************** ** ** EMBOSS wrapper to hmmbuild from Sean Eddy's HMMER package v.2.3.2 ** Build a profle HMM from an alignment. ** ******************************************************************************/ int main(int argc, char **argv) { /* ACD data item variables */ AjPSeqset alignfile = NULL; AjPFile prior = NULL; AjPFile null = NULL; AjPFile pam = NULL; float pamwgt = 0.0; AjPStr n = NULL; AjPStr *strategy = NULL; ajint pbswitch = 0; float archpri = 0.0; AjBool binary = ajFalse; AjBool fast = ajFalse; float gapmax = 0.0; AjBool hand = ajFalse; float idlevel = 0.0; AjBool noeff = ajFalse; float swentry = 0.0; float swexit = 0.0; AjBool verbosity = ajFalse; AjPStr *weighting = NULL; AjPFile hmmfile = NULL; AjPFile o = NULL; AjPFile cfile = NULL; /* Housekeeping variables */ AjPStr cmd = NULL; AjPStr rnd1 = NULL; AjPStr rnd2 = NULL; AjPStr tmp = NULL; AjPStr fmt = NULL; char option; AjBool fmtok = ajFalse; AjPStr hmmfilename = NULL;
The code below shows the function calls for processing the ACD file. embInitP
processes the ACD file and prompts the user for any required values that are not specified on the command line. The prefix ajAcdGet
family of functions are used to retrieve values from the ACD data definitions and store them in the variables defined earlier:
/* ACD file processing */ embInitP("ehmmbuild",argc,argv,"HMMERNEW"); alignfile = ajAcdGetSeqset("alignfile"); prior = ajAcdGetInfile("prior"); null = ajAcdGetInfile("null"); pam = ajAcdGetInfile("pam"); pamwgt = ajAcdGetFloat("pamwgt"); n = ajAcdGetString("n"); strategy = ajAcdGetList("strategy"); pbswitch = ajAcdGetInt("pbswitch"); archpri = ajAcdGetFloat("archpri"); binary = ajAcdGetBoolean("binary"); fast = ajAcdGetBoolean("fast"); gapmax = ajAcdGetFloat("gapmax"); hand = ajAcdGetBoolean("hand"); idlevel = ajAcdGetFloat("sidlevel"); noeff = ajAcdGetBoolean("noeff"); swentry = ajAcdGetFloat("swentry"); swexit = ajAcdGetFloat("swexit"); verbosity = ajAcdGetBoolean("verbosity"); weighting = ajAcdGetList("weighting"); hmmfile = ajAcdGetOutfile("hmmfile"); o = ajAcdGetOutfile("o"); cfile = ajAcdGetOutfile("cfile");
The start of the application code proper is shown below. First of all there is some housekeeping code. Then there is a block of code to check that the sequence alignment input file is in a format that HMMER can understand. An exception is raised if an unsupported format is specified. This could be replaced in the future with code to reformat the alignment file into an appropriate format. At the time of writing, it was not fully tested whether all alignment formats, including SELEX and STOCKHOLM, could be interconverted without any loss of data or annotation, so the safe option was chosen:
/* MAIN APPLICATION CODE */ /* 1. Housekeeping */ cmd = ajStrNew(); rnd1 = ajStrNew(); rnd2 = ajStrNew(); tmp = ajStrNew(); fmt = ajStrNew(); hmmfilename = ajStrNew(); ajStrAssignC(&hmmfilename, ajFileGetNameC(hmmfile)); /* 2. Ensure alignfile is in format HMMER can understand. These include FASTA, GENBANK,EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF,CLUSTAL and PHYLIP. EMBOSS name definitions are taken from seqInFormatDef in ajseqread.c and seqOutFormat in ajseqwrite.c */ fmtok=ajFalse; ajStrAssignS(&fmt, ajSeqsetGetFormat(alignfile)); if(ajStrMatchC(fmt, "fasta") || ajStrMatchC(fmt, "genbank") || ajStrMatchC(fmt, "embl") || ajStrMatchC(fmt, "gcg") || ajStrMatchC(fmt, "pir") || ajStrMatchC(fmt, "stockholm")|| ajStrMatchC(fmt, "selex") || ajStrMatchC(fmt, "msf") || ajStrMatchC(fmt, "clustal") || ajStrMatchC(fmt, "phylip")) fmtok = ajTrue; /* This could be replaced with code to reformat the file. */ if(!fmtok) ajFatal("Input alignment ('alignfile' ACD option) is not in a format " "HMMER understands. Please use a file in FASTA, GENBANK, " "EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF,CLUSTAL or PHYLIP format.");
The first part of the code for building the command line is shown below. The command line is constructed in a specific order to make updating the wrapper for new releases easier. First the application name is pasted into a string, then the original HMMER options are given in the order they appear in the ACD file. Next the HMMER options that do not have any parallel in the ACD file are given. Finally, new parameters and options that are specific to the EMBASSY wrapper are given:
/* 3. Build hmmbuild command line */ /* Command line is built in this order: i. Application name. ii. HMMER 'options' (in order they appear in ACD file) iii.HMMER 'options' (that don't appear in ACD file) iv. HMMER and new parameters. */ ajFmtPrintS(&cmd, "hmmbuild "); if(prior) ajFmtPrintAppS(&cmd, " --prior %s ", ajFileGetNameC(prior)); if(null) ajFmtPrintS(&cmd, " --null %s ", ajFileGetNameC(null)); if(pam) ajFmtPrintAppS(&cmd, " --pam %s --pamwgt %f ", ajFileGetNameC(pam), pamwgt); ajFmtPrintAppS(&cmd, " -n %S ", n); /* ACD option only allows one selection */ option = ajStrGetCharFirst(strategy[0]); if(option == 'F') ajStrAppendC(&cmd, " -f "); else if(option == 'G') ajStrAppendC(&cmd, " -g "); else if(option == 'S') ajStrAppendC(&cmd, " -s "); /* else go with default ('D' option in ACD file) */ ajFmtPrintAppS(&cmd, " --pbswitch %d ", pbswitch); ajFmtPrintAppS(&cmd, " --archpri %f ", archpri); if(binary) ajStrAppendC(&cmd, " --binary "); if(fast) ajFmtPrintAppS(&cmd, " --fast --gapmax %f ", gapmax); if(hand) ajStrAppendC(&cmd, " --hand "); ajFmtPrintAppS(&cmd, " --idlevel %f ", idlevel); if(noeff) ajStrAppendC(&cmd, " --noeff "); ajFmtPrintAppS(&cmd, " --swentry %f ", swentry); ajFmtPrintAppS(&cmd, " --swexit %f ", swexit); if(verbosity) ajStrAppendC(&cmd, " --verbose ");
The rest of the code for building the command line is below. The only thing to point out is that the append
option is always set. This means that output should always be appended to whatever is given in the specified output file. EMBOSS clears its output files by default though, so for this to work the append:
attribute of the hmmfile
ACD data item must be set to True
:
/* ACD option only allows one selection */ option = ajStrGetCharFirst(weighting[0]); if(option == 'B') ajStrAppendC(&cmd, " --wblosum "); else if(option == 'G') ajStrAppendC(&cmd, " --wgsc "); else if(option == 'K') ajStrAppendC(&cmd, " --wme "); else if(option == 'W') ajStrAppendC(&cmd, " --wpb "); else if(option == 'V') ajStrAppendC(&cmd, " --wvoronoi "); else if(option == 'N') ajStrAppendC(&cmd, " --wnone "); if(o) ajFmtPrintAppS(&cmd, " -o %s ", ajFileGetNameC(o)); if(cfile) ajFmtPrintAppS(&cmd, " --cfile %s ", ajFileGetNameC(cfile)); /* -A (append) always set but file will be wiped by EMBOSS first unless ** append: "Y" is set for "hmmfile" in the ACD file. */ ajStrAppendC(&cmd, " -A -F "); ajFmtPrintAppS(&cmd, " %S %S", hmmfilename, ajSeqsetGetFilename(alignfile));
The code below shows the system call to invoke the hmmbuild application using the command line just constructed. Note that system()
is used here but that should probably be replaced with a call to exec()
for reasons explained earlier. There is also some housekeeping code for memory management to ensure that the application can close cleanly:
/* 4. Close ACD files */ ajSeqsetDel(&alignfile); ajFileClose(&prior); ajFileClose(&null); ajFileClose(&pam); ajFileClose(&hmmfile); ajFileClose(&o); ajFileClose(&cfile); /* 5. Call hmmbuild */ ajFmtPrint("\n%S\n", cmd); system(ajStrGetPtr(cmd)); /* 6. Exit cleanly */ ajStrDel(&n); ajStrDel(&cmd); ajStrDel(&rnd1); ajStrDel(&rnd2); ajStrDel(&tmp); ajStrDel(&fmt); ajStrDel(&hmmfilename); embExit(); return 0; }