6.6. Handling Sequence Patterns

6.6.1. Introduction

Regular expression and other patterns are used to identify motifs in molecular sequences.

6.6.2. AJAX Library Files

AJAX library files for handling patterns are listed in the table (Table 6.7, “AJAX Library Files for Handling Patterns”). Library file documentation, including a complete description of datatypes and functions with usage notes is available at:

http://emboss.open-bio.org/rel/dev/libs/
Table 6.7. AJAX Library Files for Handling Patterns
Library File DocumentationDescription
ajregRegular expression handling
ajpatPattern handling

ajreg.h/cDefines the regular expression object (AjPRegexp) and functions for handling of regular expressions.

ajpat.h/cDefines the sequence pattern list object (AjPPatlistSeq) and general pattern list object (AjPPatlistRegex) and functions for handling lists of regular expression patterns. They contain static data structures and functions for handling sequence patterns at a low level.

You are unlikely to need the static data structures and functions unless you plan to implement code to extend the functionality of the libraries themselves.

In addition to the above library files, EMBOSS includes a library of functions to support regular expressions whose syntax and semantics are as close as possible to those of the Perl 5 language. The library files are:

pcre_config.h  
pcre_internal.h  
pcre.h           
pcre.c             
pcreposix.h  
pcreposix.c  
pcre_printint.c
pcre_chartables.c  
pcre_get.c     
pcre_study.c

They are not described here and you should see the online library documentation for further information.

6.6.3. ACD Datatypes

There are two datatypes for handling pattern input:

regexp

A regular expression pattern.

pattern

A sequence pattern.

6.6.4. ACD Data Definition

Typical ACD definitions are shown below.

6.6.4.1. regexp

Regular expression input:

regexp: patterns 
[
    information: "Regular expression patterns"
    upper: "Y"
    minlength: "3"
    minlength: "10"
]

6.6.4.2. pattern

Pattern input:

pattern: seqpatterns 
[
    information: "Sequence patterns"
    type: "nucleotide"
    pmismatch: "0"
]

6.6.4.3. Parameter Name

All data definitions for pattern or regular expression input should have the standard parameter name pattern (see ).

6.6.4.4. Common Attributes

Attributes that are typically specified are summarised below. These include various datatype-specific attributes (Section A.5, “Datatype-specific Attributes”).

minlength: Specifies the minimum length of a regular expression. This is used to ensure an expression has been defined.

maxlength: Specifies the maximum length of a regular expression.

maxsize: Specifies the maximum number of patterns that will be read from a file.

upper: Sets the case of a regular expression to uppercase. lower: is also available (only one should be given).

type: Defines a regular expression or patern to be nucleotide or protein which alows pattern matching to use sequence ambuguity codes. By default the type is string which uses exact character matching.

pmismatch: Sets the number of mismatches allowed in a pattern. This is not allowed by regular expression algorithms so is only available for sequence patterns

pname: Sets a pattern name to be used in output. Multiple input patterns have numbers appended to this name.

pformat: Defines the pattern or regular expression inut format. The default is to use the string as the pattern, or (as for sequence input) to read a string in the form @filename as a file of patterns. This file, if the format is defined as 'fasta', has a sequence FASTA-style identifier line with an optional mismatch=nn term to set the number of mismatches for each pattern (mismatches are not applicable to regular expressions). The default format is simply one sequence per line with a name automatically defined using the -pname qualifier.

6.6.5. AJAX Datatypes

The basic AJAX datatype for handling patterns is:

AjPRegexp

Regular expression.

Two AJAX datatypes are provided for handling lists of patterns including input patterns defined in the ACD file. These are lists of individual patterns and regular expressions using the datatypes below.:

AjPPatlistSeq

Sequence pattern list object holding a list of sequence patterns and associated information (for pattern ACD datatype).

AjPPatlistRegex

General pattern list object holding a list of general patterns and associated information (for regexp ACD datatype).

The following AJAX datatypes are provided for handling regular expressions at a low level beyond that provided by the static data structures and functions. You are unlikely to need to use these directly:

AjPPatComp

All required data for compiling and searching.

AjPPatternSeq

Definition of feature pattern.

AjPPatternRegex

Holds definition of feature pattern.

6.6.6. ACD File Handling

Datatypes and functions for handling pattern input via the ACD file are shown below (Table 6.8, “Datatypes and Functions for Pattern Input”).

Table 6.8. Datatypes and Functions for Pattern Input
To read a regular expressionTo read a sequence pattern
ACD datatyperegexppattern
AJAX datatypeAjPPatlistRegexAjPPatlistSeq
To retrieve from ACDajAcdGetRegexpajAcdGetPattern

Your application code will call embInit to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet* family of functions which return pointers to appropriate objects.

6.6.6.1. Input Pattern Retrieval

To retrieve an input pattern an object pointer is declared and then initialised using the appropriate ajAcdGet* function.

6.6.6.1.1. regexp
    AjPPatlistRegex patterns=NULL;

    patterns = ajAcdGetRegexp("patterns");
6.6.6.1.2. pattern
    AjPPatlistSeq seqpatterns=NULL;

    seqpatterns = ajAcdGetPattern("seqpatterns");

6.6.6.2. Alternative ACD Retrieval Functions

In cases where just a single regular expression is required ajAcdGetRegexpSingle is used to return a single AjPRegexp object:

AjPRegexp     ajAcdGetRegexpSingle (const char *token);

It is used as follows:

    AjPRegexp regexp=NULL;

    regexp = ajAcdGetRegexpSingle("patterns");

The memory for any other patterns or regular expressions is automatically cleared by a final call to embExit.

6.6.6.3. Processing Command line Options and ACD Attributes

Currently there are no functions for this.

6.6.6.4. Memory Management

It is your responsibility to close any files and free up memory at the end of the program. You must call the default destructor function (see below) on any objects returned by calls to ajAcdGetRegexp or ajAcdGetPattern.

6.6.7. Pattern Object Memory Management

6.6.7.1. Default Object Construction

To use a pattern object that is not defined in the ACD file you must first instantiate the appropriate object pointer. The default constructor functions are:

/* Create a new regular expression.  */ 
AjPPatlistRegex ajPatlistRegexNew (void);  

/* Create a new sequence pattern     */ 
AjPPatlistSeq   ajPatlistSeqNew (void);    

AjPPatComp	ajPatCompNew (void);

All constructors return the address of a new object. The pointers do not need to be initialised to NULL but it is good practice to do so:

    AjPPatlistRegex   expression = NULL;
    AjPPatlistSeq     pattern    = NULL;

    expression = ajPatlistRegexNew(); 
    pattern    = ajPatlistSeqNew();
    /* The objects are instantiated and ready for use */

6.6.7.2. Default Object Destruction

You must free the memory for an object before the pointer is re-used and also once you are finished with it. The default destructor functions are:

/* Delete a regular expression.    */ 
void ajPatlistRegexDel (AjPPatlistRegex* pthys);  

/* Delete a new sequence pattern   */  
void ajPatlistSeqDel (AjPPatlistSeq* pthys);      

Usage examples:

void ajPatternSeqDel (AjPPatternSeq* pthys);
void ajPatlistSeqDel (AjPPatlistSeq* pthys);
void ajPatCompDel (AjPPatComp* pthys);

They are used as follows:

    AjPPatlistSeq plist = NULL;
    AjPPatlistRegex rlist = NULL;
    AjPRegexp patexp = NULL;
    AjPSeqall seqall = NULL;
    AjPSeq seq = NULL;

    plist   = ajAcdGetPattern("pattern");
    rlist  = ajAcdGetRegexp("regexp");

    seqall = ajAcdGetSeqall("sequence");

    while(ajSeqallNext(seqall, &seq))
    {


......


    }
    
    ajPatlistSeqDel(&plist);
    ajPatlistRegexDel(&rlist);

6.6.7.3. Alternative Object Construction and Loading

These functions are used internally in processing lists of patterns and regular expressions. They are available for developers to use when working with lists.

/* Constructor for a sequence pattern list object (adds a pattern to a supplied list). */
AjPPatternSeq   ajPatternSeqNewList (AjPPatlistSeq plist, const AjPStr name,
                                     const AjPStr pat, ajuint mismatch);
AjPPatternRegex  ajPatternRegexNewList (AjPPatlistRegex plist,
                                        const AjPStr name,
                                        const AjPStr pat);

/* Constructor for a sequence pattern list object (parameter is true for a protein pattern) */
AjPPatlistSeq  ajPatlistSeqNewType (AjBool Protein);

/* Constructor for a pattern list object with a specified type */
AjPPatlistRegex  ajPatlistRegexNewType (ajuint type);

/* Removes current pattern from pattern list. */
void  ajPatlistSeqRemoveCurrent (AjPPatlistSeq thys);
void  ajPatlistRegexRemoveCurrent (AjPPatlistRegex thys);

/* Adds pattern into patternlist */
void  ajPatlistAddRegex (AjPPatlistRegex thys, AjPPatternRegex pat);
void  ajPatlistAddSeq (AjPPatlistSeq thys, AjPPatternSeq pat);

6.6.8. Read Functions

These functions are used in ACD processing to convert a pattern qualifier value (a pattern or a filename reference) into a pattern or regular expressin list. They can be used to define patterns from strings or files:

/* Parses a file of regular expressions into a pattern list object. */
AjPPatlistRegex  ajPatlistRegexRead(const AjPStr patspec,
                                    const AjPStr patname,
                                    const AjPStr fmt,
                                    ajuint type, 
                                    AjBool upper, AjBool lower);

/* Parses a file into pattern list object. */
AjPPatlistSeq  ajPatlistSeqRead(const AjPStr patspec,
                                const AjPStr patname,
                                const AjPStr fmt,
                                AjBool protein, 
                                ajuint mismatches);

/* Resets the pattern list iteration. */
void  ajPatlistSeqRewind (AjPPatlistSeq thys);
void  ajPatlistRegexRewind (AjPPatlistRegex thys);

6.6.9. Getting Elements of Objects

Applications work at the level of lists of patterns and regular expressions. For these, functions are available to return the number of items in the list, and to return the next item.

/* Gets next available pattern from list. */
AjBool  ajPatlistSeqGetNext (AjPPatlistSeq thys,
                             AjPPatternSeq* pattern);

/* Gets number of patterns from list. */
ajuint  ajPatlistSeqGetSize (const AjPPatlistSeq plist);

/* Gets number of patterns from list. */
ajuint  ajPatlistRegexGetSize (const AjPPatlistRegex plist);

/* Gets next available pattern from list. */
AjBool  ajPatlistRegexGetNext (AjPPatlistRegex thys,
                               AjPPatternRegex* pattern);

There are further functions to return elements of each ietm retrieved from a list:

/* Returns the name of the pattern. */
const AjPStr  ajPatternSeqGetName (const AjPPatternSeq thys);

/* Returns pattern in string format. */
const AjPStr  ajPatternSeqGetPattern (const AjPPatternSeq thys);

/* Returns void pointer to compiled pattern. */
AjPPatComp  ajPatternSeqGetCompiled (const AjPPatternSeq thys);

/* Returns true if the pattern is for a protein sequence. */
AjBool  ajPatternSeqGetProtein (const AjPPatternSeq thys);

/* Returns the mismatch of the pattern. */
ajuint  ajPatternSeqGetMismatch (const AjPPatternSeq thys);

/* Returns the name of the pattern */
const AjPStr  ajPatternRegexGetName (const AjPPatternRegex thys);

/* Returns pattern in string format. */
const AjPStr  ajPatternRegexGetPattern (const AjPPatternRegex thys);

/* Returns void pointer to compiled pattern. */
AjPRegexp  ajPatternRegexGetCompiled (const AjPPatternRegex thys);

/* Returns the type of the pattern. */
ajuint  ajPatternRegexGetType (const AjPPatternRegex thys);

6.6.10. Setting Elements of Objects

There are no elements that should be set within the pattern list objects

6.6.11. Debugging Functions

These functions report the internals of pattern objects to the debug fiule when -debug is set on the command line.

void  ajPatternSeqDebug (const AjPPatternSeq pat);
void  ajPatternRegexDebug (const AjPPatternRegex pat);

6.6.12. Miscellaneous Functions

These functions copy a list of patterns to a string for use in debugging:

ajuint  ajPatlistRegexDoc (AjPPatlistRegex thys, AjPStr* pdoc);

ajuint  ajPatlistSeqDoc (AjPPatlistSeq thys, AjPStr* pdoc);

There is also a function to return the type associated with a named type of regular expression:

ajuint  ajPatternRegexType (const AjPStr type);