10.7. HMMER Wrapper: hmmalign

We'll now look at another HMMER application, hmmalign. Its basic function is to read an HMM profile and a set of sequences, align the sequences to the profile and output a multiple sequence alignment. It is called as follows:

ehmmalign [options] hmmfile seqfile outfile

The set of sequences may be unaligned or aligned. If aligned the existing alignment is ignored and hmmalign will align them in the way it wants.

hmmalign is covered in basic detail because nearly everything that's been said about hmmbuild applies to all the other HMMER applications.

The -outfile parameter is new to EMBASSY HMMER. The multiple sequence alignment is always written to outfile rather than to stdout.

In contrast to hmmbuild the user may specify a USA for sequence input. This is because any alignment is ignored by HMMER, therefore the wrapper can treat the file as unaligned sequences which can be converted if necessary into a format that will be understood by HMMER. The application will make a temporary local copy of its input sequence data. It's down to the user to ensure that there's enough disc space in the directory it's run in.

A few of the original HMMER options are not supported. Again -h is redundant. -informat, -oneline and -outformat were provided for the user to specify the format of the input sequence file and the output alignment. None are needed in the wrapper.

More or less any sequence format will be understood, whereas the alignment format can be specified in the ACD file or by using the inbuilt -aformat command line qualifier.

10.7.1. HMMER Wrapper: hmmalign.acd

The ACD file is very simple. It only contains an input and output section. Input Section

An excerpt from the input section is shown here. Note that an infile is used for the HMM file, whereas a seqset is used for sequence input. As mentioned before, all sequence formats that EMBOSS normally supports are fully supported.

application: ehmmalign 
# EMBOSS wrapper to hmmalign from Sean Eddy's HMMER package 
# v.2.3.2
    documentation: "Align sequences to an HMM profile"
    groups: "HMM"
    gui: "yes"
    batch: "yes"
    cpu: "medium"
    embassy: "hmmernew"

section: input
    information: "Input section"
    type: "page"

infile: hmmfile  
    parameter: "Y"
    information: "HMM file"
    knowntype: "hmm file"
    help: "File containing a HMM profile"

seqset: seqfile
    parameter: "Y"
    type: "gapstopprotein"
    help: "File containing a (set of) sequence(s)"
    aligned: "N"
endsection: input Output Section

The output section is shown here. The only things to point out are that the output file is handled by an align data item and that the format of the alignment is set by the aformat: attribute:

section: output 
    information: "Output section"
    type: "page"

align: o
    parameter: "Y"
    help: "Multiple sequence alignment output file."
    aformat: "fasta"

boolean: m
    additional: "Y"
    default: "N"
    information: "Only show match state alignment symbols."

boolean: q
    additional: "Y"
    default: "N"
    information: "Suppress all output except the alignment."

endsection: output

10.7.2. HMMER Wrapper: ehmmalign.c Documentation Header

The start of the C source code is shown here. The documentation is just the same as it was for hmmbuild.

/* @source ehmmalign application
** EMBOSS wrapper to hmmalign from Sean Eddy's HMMER package v.2.3.2
** Align sequences to an HMM profile.
** @author Copyright (C) Jon Ison 
** @@
** This program is free software; you can redistribute it and/or
** modify it under the terms of the GNU General Public License
** as published by the Free Software Foundation; either version 2
** of the License, or (at your option) any later version.
** This program is distributed in the hope that it will be useful,
** but WITHOUT ANY WARRANTY; without even the implied warranty of
** GNU General Public License for more details.
** You should have received a copy of the GNU General Public License
** along with this program; if not, write to the Free Software
** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.

#include "emboss.h"

/* @prog ehmmalign ***********************************************************
** EMBOSS wrapper to hmmalign from Sean Eddy's HMMER package v.2.3.2
** Align sequences to an HMM profile.
******************************************************************************/ main() Function, Housekeeping and ACD File Processing

This shows the main() function, the variable declarations, the code to process the ACD file and some housekeeping code.

int main(int argc, char **argv)
    /* ACD data item variables */
    AjPFile    hmmfile = NULL;     
    AjPSeqset  seqfile = NULL;     
    AjPFile     mapali = NULL;     
    AjPFile    withali = NULL;     
    AjPAlign         o = NULL;     
    AjBool           m = ajFalse;
    AjBool           q = ajFalse;

    /* Housekeeping variables */
    AjPStr        cmd = NULL;
    AjPStr        tmp = NULL;
    AjPStr        fmt = NULL;
    AjBool      fmtok = ajFalse;
    AjPStr        rnd = NULL;      
    AjPSeqout    rndo = NULL;      

    /* ACD file processing */

    hmmfile = ajAcdGetInfile("hmmfile");
    seqfile = ajAcdGetSeqset("seqfile");
    mapali  = ajAcdGetInfile("mapali");
    withali = ajAcdGetInfile("withali");
    o       = ajAcdGetAlign("o");
    m       = ajAcdGetBoolean("m");
    q       = ajAcdGetBoolean("q"); 

    /* 1. Housekeeping */
    cmd  = ajStrNew();
    tmp  = ajStrNew();
    fmt  = ajStrNew();
    rnd  = ajStrNew(); File Format Handling

This shows the code required to reformat the input file into a format suitable for HMMER i.e. FASTA. You have to do this because hmmalign only understands FASTA format on input, and besides the sequence may have been specified by a USA which will need transforming into a file of sequences.

We use ajFilenameSetTempname to set an available random file name, and then ajSeqoutOpenFilename to initialise a seqout object with that file name.

The output format is set by using ajSeqoutSetFormatS. Sequences are written by using ajSeqoutWriteSeq. Finally the file is closed by using ajSeqoutClose and the seqout object is deleted:

 /* 2. Re-write seqfile to a temporary file in a format (FASTA) HMMER can understand.
       We cannot just pass the name of seqfile to HMMER as the name provided might be a 
       USA which HMMER would not understand. */
    rnd = ajStrNew();
    rndo = ajSeqoutNew();
    if(!ajSeqoutOpenFilename(rndo, rnd))
        ajFatal("Failed to open file '%S'", rnd);
    ajSeqoutSetFormatC(rndo, "fasta");
    ajSeqoutWriteSet(rndo, seqfile);
    ajSeqoutDel(&rndo); Building the Command line

Here's the code for building the command line. Once again the command line is built in a particular order to make maintenance more easy in the future.

The thing to point out here is that EMBOSS supports certain alignment formats that the original HMMER does not, and HMMER supports certain formats that EMBOSS doesn't (or didn't at the time of writing).

If the user-specified format is not supported then an exception is raised and the format is set to Stockholm. In the future this could be replaced by code to reformat the output file as appropriate.

 /* 3. Build hmmalign command line */
    /* Command line is built in this order: 
       i.  Application name.
       ii. HMMER 'options' (in order they appear in ACD file)
       iii.HMMER 'options' (that don't appear in ACD file)
       iv. HMMER and new parameters.
    ajFmtPrintS(&cmd, "hmmalign ");

        ajFmtPrintAppS(&cmd, " --mapali %s ", ajFileGetNameC(mapali));

        ajFmtPrintAppS(&cmd, " --withali %s ", ajFileGetNameC(withali));

        ajStrAppendC(&cmd, " -m ");

        ajStrAppendC(&cmd, " -q ");

    /* Ensure output alignment is in user-specified format. */
    ajStrAssignS(&fmt, ajAlignGetFormat(o));
    /* fasta and a2m are identical formats. */

    if(ajStrMatchC(fmt, "fasta"))
        ajStrAssignC(&fmt, "A2M");
    else if(ajStrMatchC(fmt, "a2m"))
        ajStrAssignC(&fmt, "A2M");
    else if(ajStrMatchC(fmt, "msf"))
        ajStrAssignC(&fmt, "MSF");
    else if(ajStrMatchC(fmt, "phylip"))
        ajStrAssignC(&fmt, "PHYLIP");
    /* hmmer also supports stockholm, SELEX and Clustal output, EMBOSS does not.
    ** EMBOSS supports unknown/multiple/simple and srs output, hmmer does not.
        fmtok = ajFalse;

        /* This could be replaced with code to reformat the file. */
        ajWarn("Specified output alignment format ('o' ACD option) is "
                "not understood by HMMER.  Using stockholm format instead.");
        ajStrAssignC(&fmt, "Stockholm");
    } Invoking the Application

This shows the code for calling the hmmalign application. Again the call to system() should probably be replaced by one to exec().

You can see that a temporary variable called rnd is used for the name of the rewritten sequence input file. The FASTA format has to be specified explicitly by using the -informat option.

 /* rnd is the name of the rewritten seqfile.  MUST specify FASTA format explicitly. */
    ajFmtPrintAppS(&cmd, " --informat FASTA --outformat %S  -o %s %s %S", 
    /* 4. Close ACD files */

    /* 5. Call hmmalign */
    ajFmtPrint("\n%S\n\n", cmd);

    /* 6. Exit cleanly */
    ajFmtPrintS(&tmp, "rm %S", rnd);


    return 0;