Regular expression and other patterns are used to identify motifs in molecular sequences.
AJAX library files for handling patterns are listed in the table (Table 6.7, “AJAX Library Files for Handling Patterns”). Library file documentation, including a complete description of datatypes and functions with usage notes is available at:
http://emboss.open-bio.org/rel/dev/libs/ |
Library File Documentation | Description |
---|---|
ajreg | Regular expression handling |
ajpat | Pattern handling |
ajreg.h/c
. Defines the regular expression object (AjPRegexp
) and functions for handling of regular expressions.
ajpat.h/c
. Defines the sequence pattern list object (AjPPatlistSeq
) and general pattern list object (AjPPatlistRegex
) and functions for handling lists of regular expression patterns. They contain static data structures and functions for handling sequence patterns at a low level.
You are unlikely to need the static data structures and functions unless you plan to implement code to extend the functionality of the libraries themselves.
In addition to the above library files, EMBOSS includes a library of functions to support regular expressions whose syntax and semantics are as close as possible to those of the Perl 5 language. The library files are:
pcre_config.h pcre_internal.h pcre.h pcre.c pcreposix.h pcreposix.c pcre_printint.c pcre_chartables.c pcre_get.c pcre_study.c
They are not described here and you should see the online library documentation for further information.
There are two datatypes for handling pattern input:
Typical ACD definitions are shown below.
Regular expression input:
regexp: patterns [ information: "Regular expression patterns" upper: "Y" minlength: "3" minlength: "10" ]
Pattern input:
pattern: seqpatterns [ information: "Sequence patterns" type: "nucleotide" pmismatch: "0" ]
All data definitions for pattern or regular expression input should have the standard parameter name pattern
(see ).
Attributes that are typically specified are summarised below. These include various datatype-specific attributes (Section A.5, “Datatype-specific Attributes”).
minlength:
Specifies the minimum length of a regular expression. This is used to ensure an expression has been defined.
maxlength:
Specifies the maximum length of a regular expression.
maxsize:
Specifies the maximum number of patterns that will be read from a file.
upper:
Sets the case of a regular expression to uppercase. lower:
is also available (only one should be given).
type:
Defines a regular expression or patern to be nucleotide
or protein
which alows pattern matching to use sequence ambuguity codes. By default the type is string
which uses exact character matching.
pmismatch:
Sets the number of mismatches allowed in a pattern. This is not allowed by regular expression algorithms so is only available for sequence patterns
pname:
Sets a pattern name to be used in output. Multiple input patterns have numbers appended to this name.
pformat:
Defines the pattern or regular expression inut format. The default is to use the string as the pattern, or (as for sequence input) to read a string in the form @filename
as a file of patterns. This file, if the format is defined as 'fasta', has a sequence FASTA-style identifier line with an optional mismatch=nn
term to set the number of mismatches for each pattern (mismatches are not applicable to regular expressions). The default format is simply one sequence per line with a name automatically defined using the -pname
qualifier.
The basic AJAX datatype for handling patterns is:
AjPRegexp
Regular expression.
Two AJAX datatypes are provided for handling lists of patterns including input patterns defined in the ACD file. These are lists of individual patterns and regular expressions using the datatypes below.:
AjPPatlistSeq
Sequence pattern list object holding a list of sequence patterns and associated information (for pattern
ACD datatype).
AjPPatlistRegex
General pattern list object holding a list of general patterns and associated information (for regexp
ACD datatype).
The following AJAX datatypes are provided for handling regular expressions at a low level beyond that provided by the static data structures and functions. You are unlikely to need to use these directly:
AjPPatComp
All required data for compiling and searching.
AjPPatternSeq
Definition of feature pattern.
AjPPatternRegex
Holds definition of feature pattern.
Datatypes and functions for handling pattern input via the ACD file are shown below (Table 6.8, “Datatypes and Functions for Pattern Input”).
To read a regular expression | To read a sequence pattern | |
ACD datatype | regexp | pattern |
---|---|---|
AJAX datatype | AjPPatlistRegex | AjPPatlistSeq |
To retrieve from ACD | ajAcdGetRegexp | ajAcdGetPattern |
Your application code will call embInit
to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet*
family of functions which return pointers to appropriate objects.
To retrieve an input pattern an object pointer is declared and then initialised using the appropriate ajAcdGet*
function.
In cases where just a single regular expression is required ajAcdGetRegexpSingle
is used to return a single AjPRegexp
object:
AjPRegexp ajAcdGetRegexpSingle (const char *token);
It is used as follows:
AjPRegexp regexp=NULL; regexp = ajAcdGetRegexpSingle("patterns");
The memory for any other patterns or regular expressions is automatically cleared by a final call to embExit
.
Currently there are no functions for this.
To use a pattern object that is not defined in the ACD file you must first instantiate the appropriate object pointer. The default constructor functions are:
/* Create a new regular expression. */ AjPPatlistRegex ajPatlistRegexNew (void); /* Create a new sequence pattern */ AjPPatlistSeq ajPatlistSeqNew (void); AjPPatComp ajPatCompNew (void);
All constructors return the address of a new object. The pointers do not need to be initialised to NULL
but it is good practice to do so:
AjPPatlistRegex expression = NULL; AjPPatlistSeq pattern = NULL; expression = ajPatlistRegexNew(); pattern = ajPatlistSeqNew(); /* The objects are instantiated and ready for use */
You must free the memory for an object before the pointer is re-used and also once you are finished with it. The default destructor functions are:
/* Delete a regular expression. */ void ajPatlistRegexDel (AjPPatlistRegex* pthys); /* Delete a new sequence pattern */ void ajPatlistSeqDel (AjPPatlistSeq* pthys);
Usage examples:
void ajPatternSeqDel (AjPPatternSeq* pthys); void ajPatlistSeqDel (AjPPatlistSeq* pthys); void ajPatCompDel (AjPPatComp* pthys);
They are used as follows:
AjPPatlistSeq plist = NULL; AjPPatlistRegex rlist = NULL; AjPRegexp patexp = NULL; AjPSeqall seqall = NULL; AjPSeq seq = NULL; plist = ajAcdGetPattern("pattern"); rlist = ajAcdGetRegexp("regexp"); seqall = ajAcdGetSeqall("sequence"); while(ajSeqallNext(seqall, &seq)) { ...... } ajPatlistSeqDel(&plist); ajPatlistRegexDel(&rlist);
These functions are used internally in processing lists of patterns and regular expressions. They are available for developers to use when working with lists.
/* Constructor for a sequence pattern list object (adds a pattern to a supplied list). */ AjPPatternSeq ajPatternSeqNewList (AjPPatlistSeq plist, const AjPStr name, const AjPStr pat, ajuint mismatch); AjPPatternRegex ajPatternRegexNewList (AjPPatlistRegex plist, const AjPStr name, const AjPStr pat); /* Constructor for a sequence pattern list object (parameter is true for a protein pattern) */ AjPPatlistSeq ajPatlistSeqNewType (AjBool Protein); /* Constructor for a pattern list object with a specified type */ AjPPatlistRegex ajPatlistRegexNewType (ajuint type); /* Removes current pattern from pattern list. */ void ajPatlistSeqRemoveCurrent (AjPPatlistSeq thys); void ajPatlistRegexRemoveCurrent (AjPPatlistRegex thys); /* Adds pattern into patternlist */ void ajPatlistAddRegex (AjPPatlistRegex thys, AjPPatternRegex pat); void ajPatlistAddSeq (AjPPatlistSeq thys, AjPPatternSeq pat);
These functions are used in ACD processing to convert a pattern qualifier value (a pattern or a filename reference) into a pattern or regular expressin list. They can be used to define patterns from strings or files:
/* Parses a file of regular expressions into a pattern list object. */ AjPPatlistRegex ajPatlistRegexRead(const AjPStr patspec, const AjPStr patname, const AjPStr fmt, ajuint type, AjBool upper, AjBool lower); /* Parses a file into pattern list object. */ AjPPatlistSeq ajPatlistSeqRead(const AjPStr patspec, const AjPStr patname, const AjPStr fmt, AjBool protein, ajuint mismatches); /* Resets the pattern list iteration. */ void ajPatlistSeqRewind (AjPPatlistSeq thys); void ajPatlistRegexRewind (AjPPatlistRegex thys);
Applications work at the level of lists of patterns and regular expressions. For these, functions are available to return the number of items in the list, and to return the next item.
/* Gets next available pattern from list. */ AjBool ajPatlistSeqGetNext (AjPPatlistSeq thys, AjPPatternSeq* pattern); /* Gets number of patterns from list. */ ajuint ajPatlistSeqGetSize (const AjPPatlistSeq plist); /* Gets number of patterns from list. */ ajuint ajPatlistRegexGetSize (const AjPPatlistRegex plist); /* Gets next available pattern from list. */ AjBool ajPatlistRegexGetNext (AjPPatlistRegex thys, AjPPatternRegex* pattern);
There are further functions to return elements of each ietm retrieved from a list:
/* Returns the name of the pattern. */ const AjPStr ajPatternSeqGetName (const AjPPatternSeq thys); /* Returns pattern in string format. */ const AjPStr ajPatternSeqGetPattern (const AjPPatternSeq thys); /* Returns void pointer to compiled pattern. */ AjPPatComp ajPatternSeqGetCompiled (const AjPPatternSeq thys); /* Returns true if the pattern is for a protein sequence. */ AjBool ajPatternSeqGetProtein (const AjPPatternSeq thys); /* Returns the mismatch of the pattern. */ ajuint ajPatternSeqGetMismatch (const AjPPatternSeq thys); /* Returns the name of the pattern */ const AjPStr ajPatternRegexGetName (const AjPPatternRegex thys); /* Returns pattern in string format. */ const AjPStr ajPatternRegexGetPattern (const AjPPatternRegex thys); /* Returns void pointer to compiled pattern. */ AjPRegexp ajPatternRegexGetCompiled (const AjPPatternRegex thys); /* Returns the type of the pattern. */ ajuint ajPatternRegexGetType (const AjPPatternRegex thys);
There are no elements that should be set within the pattern list objects
These functions report the internals of pattern objects to the debug fiule when -debug
is set on the command line.
void ajPatternSeqDebug (const AjPPatternSeq pat); void ajPatternRegexDebug (const AjPPatternRegex pat);
These functions copy a list of patterns to a string for use in debugging:
ajuint ajPatlistRegexDoc (AjPPatlistRegex thys, AjPStr* pdoc); ajuint ajPatlistSeqDoc (AjPPatlistSeq thys, AjPStr* pdoc);
There is also a function to return the type associated with a named type of regular expression:
ajuint ajPatternRegexType (const AjPStr type);