We'll now look at another HMMER application, hmmalign. Its basic function is to read an HMM profile and a set of sequences, align the sequences to the profile and output a multiple sequence alignment. It is called as follows:
ehmmalign [ |
The set of sequences may be unaligned or aligned. If aligned the existing alignment is ignored and hmmalign will align them in the way it wants.
hmmalign is covered in basic detail because nearly everything that's been said about hmmbuild applies to all the other HMMER applications.
The -outfile
parameter is new to EMBASSY HMMER. The multiple sequence alignment is always written to outfile
rather than to stdout
.
In contrast to hmmbuild the user may specify a USA for sequence input. This is because any alignment is ignored by HMMER, therefore the wrapper can treat the file as unaligned sequences which can be converted if necessary into a format that will be understood by HMMER. The application will make a temporary local copy of its input sequence data. It's down to the user to ensure that there's enough disc space in the directory it's run in.
A few of the original HMMER options are not supported. Again -h
is redundant. -informat
, -oneline
and -outformat
were provided for the user to specify the format of the input sequence file and the output alignment. None are needed in the wrapper.
More or less any sequence format will be understood, whereas the alignment format can be specified in the ACD file or by using the inbuilt -aformat
command line qualifier.
The ACD file is very simple. It only contains an input and output section.
An excerpt from the input section is shown here. Note that an infile
is used for the HMM file, whereas a seqset
is used for sequence input. As mentioned before, all sequence formats that EMBOSS normally supports are fully supported.
application: ehmmalign [ # EMBOSS wrapper to hmmalign from Sean Eddy's HMMER package # v.2.3.2 documentation: "Align sequences to an HMM profile" groups: "HMM" gui: "yes" batch: "yes" cpu: "medium" embassy: "hmmernew" ] section: input [ information: "Input section" type: "page" ] infile: hmmfile [ parameter: "Y" information: "HMM file" knowntype: "hmm file" help: "File containing a HMM profile" ] seqset: seqfile [ parameter: "Y" type: "gapstopprotein" help: "File containing a (set of) sequence(s)" aligned: "N" ] ... endsection: input
The output section is shown here. The only things to point out are that the output file is handled by an align
data item and that the format of the alignment is set by the aformat:
attribute:
section: output [ information: "Output section" type: "page" ] align: o [ parameter: "Y" help: "Multiple sequence alignment output file." aformat: "fasta" ] boolean: m [ additional: "Y" default: "N" information: "Only show match state alignment symbols." ] boolean: q [ additional: "Y" default: "N" information: "Suppress all output except the alignment." ] endsection: output
The start of the C source code is shown here. The documentation is just the same as it was for hmmbuild.
/* @source ehmmalign application ** ** EMBOSS wrapper to hmmalign from Sean Eddy's HMMER package v.2.3.2 ** Align sequences to an HMM profile. ** ** @author Copyright (C) Jon Ison ** @@ ** ** This program is free software; you can redistribute it and/or ** modify it under the terms of the GNU General Public License ** as published by the Free Software Foundation; either version 2 ** of the License, or (at your option) any later version. ** ** This program is distributed in the hope that it will be useful, ** but WITHOUT ANY WARRANTY; without even the implied warranty of ** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ** GNU General Public License for more details. ** ** You should have received a copy of the GNU General Public License ** along with this program; if not, write to the Free Software ** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. ******************************************************************************/ #include "emboss.h" /* @prog ehmmalign *********************************************************** ** ** EMBOSS wrapper to hmmalign from Sean Eddy's HMMER package v.2.3.2 ** Align sequences to an HMM profile. ** ******************************************************************************/
This shows the main()
function, the variable declarations, the code to process the ACD file and some housekeeping code.
int main(int argc, char **argv) { /* ACD data item variables */ AjPFile hmmfile = NULL; AjPSeqset seqfile = NULL; AjPFile mapali = NULL; AjPFile withali = NULL; AjPAlign o = NULL; AjBool m = ajFalse; AjBool q = ajFalse; /* Housekeeping variables */ AjPStr cmd = NULL; AjPStr tmp = NULL; AjPStr fmt = NULL; AjBool fmtok = ajFalse; AjPStr rnd = NULL; AjPSeqout rndo = NULL; /* ACD file processing */ embInitP("ehmmalign",argc,argv,"HMMERNEW"); hmmfile = ajAcdGetInfile("hmmfile"); seqfile = ajAcdGetSeqset("seqfile"); mapali = ajAcdGetInfile("mapali"); withali = ajAcdGetInfile("withali"); o = ajAcdGetAlign("o"); m = ajAcdGetBoolean("m"); q = ajAcdGetBoolean("q"); /* MAIN APPLICATION CODE */ /* 1. Housekeeping */ cmd = ajStrNew(); tmp = ajStrNew(); fmt = ajStrNew(); rnd = ajStrNew();
This shows the code required to reformat the input file into a format suitable for HMMER i.e. FASTA. You have to do this because hmmalign only understands FASTA format on input, and besides the sequence may have been specified by a USA which will need transforming into a file of sequences.
We use ajFilenameSetTempname
to set an available random file name, and then ajSeqoutOpenFilename
to initialise a seqout
object with that file name.
The output format is set by using ajSeqoutSetFormatS
. Sequences are written by using ajSeqoutWriteSeq
. Finally the file is closed by using ajSeqoutClose
and the seqout
object is deleted:
/* 2. Re-write seqfile to a temporary file in a format (FASTA) HMMER can understand. We cannot just pass the name of seqfile to HMMER as the name provided might be a USA which HMMER would not understand. */ rnd = ajStrNew(); ajFilenameSetTempname(&rnd); rndo = ajSeqoutNew(); if(!ajSeqoutOpenFilename(rndo, rnd)) ajFatal("Failed to open file '%S'", rnd); ajSeqoutSetFormatC(rndo, "fasta"); ajSeqoutWriteSet(rndo, seqfile); ajSeqoutClose(rndo); ajSeqoutDel(&rndo);
Here's the code for building the command line. Once again the command line is built in a particular order to make maintenance more easy in the future.
The thing to point out here is that EMBOSS supports certain alignment formats that the original HMMER does not, and HMMER supports certain formats that EMBOSS doesn't (or didn't at the time of writing).
If the user-specified format is not supported then an exception is raised and the format is set to Stockholm. In the future this could be replaced by code to reformat the output file as appropriate.
/* 3. Build hmmalign command line */ /* Command line is built in this order: i. Application name. ii. HMMER 'options' (in order they appear in ACD file) iii.HMMER 'options' (that don't appear in ACD file) iv. HMMER and new parameters. */ ajFmtPrintS(&cmd, "hmmalign "); if(mapali) ajFmtPrintAppS(&cmd, " --mapali %s ", ajFileGetNameC(mapali)); if(withali) ajFmtPrintAppS(&cmd, " --withali %s ", ajFileGetNameC(withali)); if(m) ajStrAppendC(&cmd, " -m "); if(q) ajStrAppendC(&cmd, " -q "); /* Ensure output alignment is in user-specified format. */ fmtok=ajTrue; ajStrAssignS(&fmt, ajAlignGetFormat(o)); /* fasta and a2m are identical formats. */ if(ajStrMatchC(fmt, "fasta")) ajStrAssignC(&fmt, "A2M"); else if(ajStrMatchC(fmt, "a2m")) ajStrAssignC(&fmt, "A2M"); else if(ajStrMatchC(fmt, "msf")) ajStrAssignC(&fmt, "MSF"); else if(ajStrMatchC(fmt, "phylip")) ajStrAssignC(&fmt, "PHYLIP"); /* hmmer also supports stockholm, SELEX and Clustal output, EMBOSS does not. ** EMBOSS supports unknown/multiple/simple and srs output, hmmer does not. */ else fmtok = ajFalse; if(!fmtok) { /* This could be replaced with code to reformat the file. */ ajWarn("Specified output alignment format ('o' ACD option) is " "not understood by HMMER. Using stockholm format instead."); ajStrAssignC(&fmt, "Stockholm"); }
This shows the code for calling the hmmalign application. Again the call to system()
should probably be replaced by one to exec()
.
You can see that a temporary variable called rnd
is used for the name of the rewritten sequence input file. The FASTA format has to be specified explicitly by using the -informat
option.
/* rnd is the name of the rewritten seqfile. MUST specify FASTA format explicitly. */ ajFmtPrintAppS(&cmd, " --informat FASTA --outformat %S -o %s %s %S", fmt, ajAlignGetFilename(o), ajFileGetNameC(hmmfile), rnd); /* 4. Close ACD files */ ajFileClose(&hmmfile); ajSeqsetDel(&seqfile); ajFileClose(&mapali); ajFileClose(&withali); ajAlignDel(&o); /* 5. Call hmmalign */ ajFmtPrint("\n%S\n\n", cmd); system(ajStrGetPtr(cmd)); /* 6. Exit cleanly */ ajFmtPrintS(&tmp, "rm %S", rnd); system(ajStrGetPtr(tmp)); ajStrDel(&cmd); ajStrDel(&tmp); ajStrDel(&fmt); ajStrDel(&rnd); embExit(); return 0; }