6.5. Handling Strings

6.5.1. Introduction

Efficient and flexible string handling is fundamental to molecular sequence manipulation. Accordingly, string handling is the best developed area in the AJAX library. The functionality, which is spread over several library files, is comprehensive and includes:

  • String construction

  • String referencing and dereferencing where a handle on, but not a copy of, a string is required

  • Assignment functions to assign a value to a string

  • Functions to combine two strings or parts of a string. The types of operation include appending, insertion and pasting (overwriting character positions)

  • Cut functions to remove substrings, regions or characters from a target string

  • Substitutions of characters or substrings of a string with other characters/substrings

  • Query functions to test the properties of a string

  • Retrieval of characters and string properties (such as length)

  • Conversion functions to convert a string to some other datatype

  • String formatting

  • String comparison functions

  • Search functions to find substrings or characters in strings

  • String parsing functions to parse text tokens from strings

  • String iteration, which allows you to step through a string a single character at a time

  • String tokenisation

  • Formatting and printing. Conversion characters are defined for all the EMBOSS fundamental datatypes (Section 5.1, “Basic Datatypes”) and are an extension of the basic C conversion codes

For convenience, most functions implemented for an AJAX string parameter have a corresponding function with a C-type (char *) string parameter. A string may be defined in the ACD file and retrieved from the C source code by a call to ajAcdGetString. More typically though, strings are created directly in the code.

In contrast to standard C-type (char *) strings, the AJAX string object (AjPStr) is dynamic; memory is (re)allocated as needed so that you never run out of space when calling string functions. AJAX strings are reference counted. The object keeps track of how many references (pointers to) the string there are in the code. It is not until all references to a string have been deleted that the string itself is freed.

The string object definition is shown below:

typedef struct AjSStr
{
    ajuint Res;
    ajuint Len;
    char *Ptr;
    ajuint Use;
    ajint Padding;
} AjOStr;
#define AjPStr AjOStr*
typedef AjPStr* AjPPStr;

Ptr holds the character string and Len is its length. In contrast to C-type strings the character string may or may not be NULL terminated. The library functions for printing AjPStr objects uses the length field (Len) for how many characters to print and won't stop at the first NULL if there is one.

Res is the reserved dynamic memory associated with the object and is always at least equal to Len but is often more. It is used for handling dynamic reallocation of string memory. Use is the string reference counter mentioned above. Finally, the Padding element pads the string to an alignment boundary (to mollify strict compilers).

The string object and the internals of string memory management are described in greater detail elsewhere (Section 5.5, “Programming with Objects”).

6.5.2. AJAX Library Files

AJAX library files for handling strings are listed in the table (Table 6.5, “AJAX Library Files for Handling Strings”). Library file documentation, including a complete description of datatypes and functions, is available at:

http://emboss.open-bio.org/rel/dev/libs/
Table 6.5. AJAX Library Files for Handling Strings
Library File DocumentationDescription
ajstrGeneral string handling
ajfmtString formatting functions

ajstr.h/cMost of the functions you will ever need for general string handling. They define the basic string object AjPStr, string iteration object (AjIStr and string token parser object (AjPStrTok) for use with the functions.

ajfmt.h/cFunctions for string formatting. The functions are similar to the C functions printf, fprintf etc, but the set of conversion specifiers and other functionality is extended. They also contain a static data structure and functions for handling formatting at a low level (Section 6.5.23, “Handling String Formatting”).

You are unlikely to need the static data structures and functions unless you plan to extend the string library.

6.5.3. ACD Datatypes

The ACD datatype for handling string input is:

AjPStr

String.

6.5.4. ACD Data Definition

A typical ACD definition for string input:

string: delimiter 
[
    default: "|"
    information: "Delimiter of records in text output file"
    knowntype: "output delimiter"
]

6.5.4.1. Parameter Name

A standard parameter name (Section A.1.3, “Parameter Naming Conventions”) might be used. This depending on the specific use-case of the data definition.

6.5.4.2. Common Attributes

Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).

default: A global attribute that specifies a default value.

information: A global attribute that specifies the user-prompt and is used in the application documentation.

knowntype: This global attribute should always be specified for string inputs. If the output is not of any of the standard EMBOSS known types (Section 4.3.5.3.1, “Application Data Known Types File (knowntypes.standard)”) then ApplicationName output is the recommended value. .

6.5.5. AJAX Datatypes

For handling strings, including those defined in the ACD file (string ACD datatype), use:

AjPStr

String.

Two datatypes are for string-related operations:

AjIStr

String iteration object.

AjPStrTok

String token parser object.

6.5.6. ACD File Handling

Datatypes and functions for handling string input via the ACD file are shown below (Table 6.6, “Datatypes and Functions for String Input”).

Table 6.6. Datatypes and Functions for String Input
To read a string
ACD datatypestring
AJAX datatypeAjPStr
To retrieve from ACDajAcdGetString

6.5.6.1. Input String Retrieval

To retrieve an input string an object pointer is declared and then initialised using ajAcdGetString:

    AjPStr delimiter = NULL;

    delimiter = ajAcdGetString("delimiter");

6.5.6.2. Processing Command line Options and ACD Attribute

Currently there are no functions for this.

6.5.7. String Object Memory Management

6.5.7.1. Default Object Construction

To use a string object that is not defined in the ACD file you must first instantiate the appropriate object pointer. The default string construction function is:

/* Create a string object.        */
AjPStr  ajStrNew (void);

All constructors return the address of a new object. The pointers do not need to be initialised to NULL but it is good practice to do so:

    AjPStr       str = NULL;

    str    = ajStrNew();

    /* The object is instantiated and ready for use */

6.5.7.2. Default Object Destruction

You must free the memory for an object once you are finished with it. The default string destructor function is:

/* Delete a string object.        */
AjPStr  ajStrDel (AjPStr *Pstr);    

It is the responsibility of the calling function to destroy any objects

    AjPStr str = NULL;

    str = ajStrNew();

    /* Do something with the instantiated object */

    ajStrDel(&str);

    /* The memory is freed and the pointer reset to NULL, ready for re-use. */

    str = ajStrNew();

    /* Do something else with the new object.  The pointer variable is reallocated. */

    ajStrDel(&str);

    /* Done with the object so the memory is freed. */

6.5.7.3. Alternative Object Construction and Loading

A variety of alternative string constructor functions are available. A string can be constructed from an existing string object (AjPStr) or C-type (char *) string, with an optional reserved size:

/* Construct from C-type string */
AjPStr  ajStrNewC (const char *txt);                                  

/* Construct from C-type string with reserved size */    
AjPStr  ajStrNewResC (const char *txt, ajuint size);                  

/* Construct from C-type string with explicit reserved size */
AjPStr  ajStrNewResLenC (const char *txt, ajuint size, ajuint len);   

/* Construct with reserved size */ 
AjPStr  ajStrNewRes(ajuint size);                                     

/* Construct from string object */ 
AjPStr  ajStrNewS (const AjPStr str);                                 

/* Construct from string object with reserved size */
AjPStr  ajStrNewResS (const AjPStr str, ajuint size);

ajStrNewResLenC is identical to ajStrNewResC except that the string length is passed to ajStrNewResLenC for speed.

They are all used in same way as the default constructor i.e. they return a pointer to the new object.

6.5.8. String Referencing and Dereferencing Functions

There is a string referencing function:

/* Reference an existing string */
AjPStr  ajStrNewRef (AjPStr str);

In contrast to the other constructor functions ajStrNewRef does not create a new object but instead returns a pointer to the string passed in and increases its reference count.

There is a string dereferencing function:

/* Dereference an existing string */
AjBool  ajStrDelStatic (AjPStr* Pstr);

ajStrDelStatic will set the string pointer to NULL and decrement the use count of the string to which it refers. In contrast to the default destructor function, strings with a use count of 1 are not freed to avoid freeing and reallocating memory when they are reused. Memory reserved for the string is never deleted by this function and can be reused.

6.5.9. String Assignment Functions

The string assignment functions will assign a value to a string. A string can be assigned from a character, an existing string object (AjPStr) or C-type (char *) string, or a substring of an appropriate datatype. Some function variants allow optional reserved sizes to be specified:

/* Assign from character     */    
AjBool  ajStrAssignK (AjPStr* Pstr, char chr);                                    

/* Assign from C-type string */
AjBool  ajStrAssignC (AjPStr* Pstr, const char* txt);                             

/* Assign from string object */
AjBool  ajStrAssignS (AjPStr* Pstr, const AjPStr str);                            

/* Assign from C-type string up to a given length */
AjBool  ajStrAssignLenC (AjPStr* Pstr, const char* txt, ajuint ilen);             

/* Assign using a pointer only. The reference count is incremented */
AjBool  ajStrAssignRef (AjPStr* Pstr, AjPStr refstr);         

/* Assign from C-type string with reserved size */
AjBool  ajStrAssignResC (AjPStr* Pstr, ajuint size, const char* txt);             

/* Assign from string object with reserved size */
AjBool  ajStrAssignResS (AjPStr* Pstr, ajuint i, const AjPStr str);               

/* Assign from substring of C-type string */
AjBool  ajStrAssignSubC (AjPStr* Pstr, const char* txt,  ajint pos1, ajint pos2); 

/* Assign from substring of string object */
AjBool  ajStrAssignSubS (AjPStr* Pstr, const AjPStr str, ajint pos1, ajint pos2); 

ajStrAssignLenC is identical to ajStrAssignC except that the source string is only copied up to a specified length.

Memory for the string is allocated to NULL target pointers if necessary, although to keep the calling code intuitive we strongly recommend that a string object is first instantiated by calling ajStrNew before any of these functions are used.

For example, in the following code it is clear you are dealing with two separate strings:

    AjPStr str     = NULL;
    AjPStr strcopy = NULL;

    str     = ajStrNewC("A string");
    strcopy = ajStrNew();

    if(!ajStrAssignC(&strcopy, str))
        ajFatal("String not assigned");

    ajStrDel(&str);
    ajStrDel(&strcopy);

Whereas the following code is perfectly valid but is less clear:

    AjPStr str     = NULL;
    AjPStr strcopy = NULL;

    str = ajStrNewC("A string");

    if(!ajStrAssignC(&strcopy, str))
        ajFatal("String not assigned");

    ajStrDel(&str);
    ajStrDel(&strcopy);

6.5.10. String Combination Functions

The string combination functions will combine two strings together. They fall into a variety of classes described below.

6.5.10.1. String append functions

The string append functions will append a source string to a target string. An individual character or multiple characters, an existing string object (AjPStr) or C-type (char *) string, or a substring of either of the latter can be appended:

/* Append a C-type string */
AjBool  ajStrAppendC (AjPStr* Pstr, const char* txt);                               

/* Append a single character */
AjBool  ajStrAppendK (AjPStr* Pstr, char chr);                                      

/* Append a string object */
AjBool  ajStrAppendS (AjPStr* Pstr, const AjPStr str);                              

/* Append multiples of a single character */
AjBool  ajStrAppendCountK (AjPStr* Pstr, char chr, ajuint num);                     

/* Append a C-type string up to a given length */
AjBool  ajStrAppendLenC (AjPStr* Pstr, const char* txt, ajuint len);                

/* Append a substring of a string object */
AjBool  ajStrAppendSubS (AjPStr* Pstr, const AjPStr str, ajint pos1, ajint pos2);   

ajStrAppendLenC is identical to ajStrAppendC except that a region from the source string up to a specified length is appended.

6.5.10.2. String Join Functions

The string join functions are similar to the append functions except that they cut the source and target strings at specified positions before appending:

/* Cut down string at pos1 and add string2 from position pos2. */
AjBool  ajStrJoinC (AjPStr* Pstr, ajint pos1, const char* txt, ajint pos2);
AjBool  ajStrJoinS (AjPStr* Pstr, ajint pos1,  const AjPStr str, ajint pos2);

6.5.10.3. String Insert Functions

The string insert functions will insert a character, an existing string object (AjPStr) or C-type (char *) string into a string:

/* Insert a C-type string */
AjBool  ajStrInsertC (AjPStr* pthis, ajint pos, const char* str);   

/* Insert a character     */
AjBool  ajStrInsertK (AjPStr* pthis, ajint begin, char insert);     

/* Insert a string        */
AjBool  ajStrInsertS  (AjPStr* pthis, ajint pos, const AjPStr str); 

6.5.10.4. String Paste Functions

The string paste functions will overwrite the target string with the source string (or character) at a specified point (pos), using (optionally) up to a specified number of characters from the source string:

/* Paste string  */
AjBool  ajStrPasteS( AjPStr* Pstr, ajint pos, const AjPStr str);                

/* Paste specified number of characters    */
AjBool  ajStrPasteCountK(AjPStr* Pstr, ajint pos, char chr, ajuint num);        

/* Paste portion of C-type string */
AjBool  ajStrPasteMaxC (AjPStr* Pstr, ajint pos, const char* txt, ajuint n);    

/* Paste portion of string object */
AjBool  ajStrPasteMaxS( AjPStr* Pstr, ajint pos, const AjPStr str, ajuint n);   

In addition there is a string masking function which will replace all characters in the target string with a mask character over a specified range:

/* Replace all characters in a region with mask characters */
AjBool  ajStrMaskRange(AjPStr* str, ajint begin, ajint end, char maskchar);          

6.5.11. String Cut Functions

The string cut functions will remove regions or individual characters from a target string. A selection of the available functions in various functional categories are described below. All the functions return ajTrue if the operation was performed successfully or ajFalse otherwise.

6.5.11.1. Simple cut functions

A number of characters can be removed from the start, end or interior of a string using:

/* Removes a number of characters from the start of a string. */
AjBool  ajStrCutStart(AjPStr* Pstr, ajuint len);                   

/* Removes a number of characters from the end of a string. */
AjBool  ajStrCutEnd(AjPStr* Pstr, ajuint len);                     

/* Removes a region from a string. */
AjBool  ajStrCutRange(AjPStr* Pstr, ajint pos1, ajint pos2);       

6.5.11.2. Removing characters from a string

Functions to remove characters from a string include:

/* Removes non-sequence characters (all but alphabetic characters and asterisk) */
AjBool  ajStrRemoveGap(AjPStr* thys);                              

/* Removes HTML mark-up from a string. */
AjBool  ajStrRemoveHtml(AjPStr* pthis);                            

/* Removes last character from a string if it is a newline character. */
AjBool  ajStrRemoveLastNewline(AjPStr* Pstr);                      

/* Removes all of a given set of characters from a string. */
AjBool  ajStrRemoveSetC(AjPStr* Pstr, const char *txt);            

/* Removes all whitespace characters from a string. */
AjBool  ajStrRemoveWhite(AjPStr* Pstr);                            

/* Removes excess whitespace characters from a string. */
AjBool  ajStrRemoveWhiteExcess(AjPStr* Pstr);                      

/* Removes excess space characters from a string. */
AjBool  ajStrRemoveWhiteSpaces(AjPStr* Pstr);                      

/* Removes all characters after the first wildcard character (if found). */
AjBool  ajStrRemoveWild(AjPStr* Pstr); 

ajStrRemoveWhiteExcess and ajStrRemoveWhiteSpaces both remove the leading/trailing whitespace from a string and replace multiple spaces with a single space. Additionally, ajStrRemoveWhiteSpaces converts tabs to spaces but leaves newline characters unchanged.

6.5.11.3. Retaining characters in a string

Functions are available to remove a region from a string or all characters in a string other than those in a defined set. The character sets can be provided either as a string object (AjPStr) or C-type (char *) string:

/* Trim sequence down to a defined range */
AjBool  ajStrKeepRange(AjPStr* Pstr, ajint pos1, ajint pos2);      

/* Removes all characters that are not in a given set. */
AjBool  ajStrKeepSetC(AjPStr* Pstr, const char* txt);              

/* Removes all characters that are not in a given set. */
AjBool  ajStrKeepSetS(AjPStr* Pstr, const AjPStr str);             

/* Removes all characters that are not alphabetic.
AjBool  ajStrKeepSetAlpha(AjPStr* Pstr);                           

/* Removes all characters that are not alphabetic and are not in a given set. */
AjBool  ajStrKeepSetAlphaC(AjPStr* Pstr, const char* txt); 

6.5.11.4. String trimming functions

The string trim functions below will remove region(s) of a given character composition (provided in the string txt) from the start and/or end of a string:

/* Remove from start of a string */
AjBool  ajStrTrimStartC (AjPStr* Pstr, const char* txt);            

/* Remove from end of a string */
AjBool  ajStrTrimEndC (AjPStr* Pstr, const char* txt);              

/* Remove from start and end of a string */
AjBool  ajStrTrimC (AjPStr* pthis, const char* txt); 

All characters will be removed from the start and/or end up to the first character that is not in the set provided.

Similar functions are provided to remove regions composed of white space characters only from the start and end of a string.

/* Remove from start and end of a string. */
AjBool  ajStrTrimWhite (AjPStr* Pstr);         

/* Remove from start of a string. */
AjBool  ajStrTrimWhiteStart (AjPStr* Pstr);    

/* Remove from end of a string. */
AjBool  ajStrTrimWhiteEnd (AjPStr* Pstr); 

There are also two truncate functions which remove characters from the end of a string reducing it to a defined length (ajStrTruncateLen) or cut the end off a string at a defined position (ajStrTruncatePos):

AjBool  ajStrTruncateLen (AjPStr* Pstr, ajuint len);
AjBool  ajStrTruncatePos (AjPStr* Pstr, ajint pos);

6.5.12. String Substitution Functions

The string substitution functions will perform substitutions of characters or substrings of a string with other characters/substrings.

Functions with the prefix ajStrExchange will replace all occurrences in a string of one substring (or character) with another string (or character). Variants of the function support string objects (AjPStr) and C-type (char *) strings for the target and replacement substrings:

/* C-type string target and replacement.    */
AjBool  ajStrExchangeCC (AjPStr* Pstr, const char* txt, const char* txtnew);         

/* C-type string target, string replacement */
AjBool  ajStrExchangeCS (AjPStr* Pstr, const char* txt, const AjPStr strnew);        

/* Character target and replacement         */
AjBool  ajStrExchangeKK (AjPStr* Pstr, char chr, char chrnew);                       

/* String target, C-type string replacement */
AjBool  ajStrExchangeSC (AjPStr* Pstr, const AjPStr str, const char* txtnew);        

/* String target and replacement            */
AjBool  ajStrExchangeSS (AjPStr* Pstr, const AjPStr str, const AjPStr strnew); 

Functions with the prefix ajStrExchangeSet are similar except that they replace all occurrences in a string of one set of characters with another character or set of characters. Variants of the function use string objects (AjPStr) and C-type (char *) strings to define the sets:

/* C-type string target and replacement sets   */
AjBool  ajStrExchangeSetCC (AjPStr* Pstr, const char* txt,const char* newc);         

/* String target and replacement sets          */
AjBool  ajStrExchangeSetSS (AjPStr* Pstr, const AjPStr str,const AjPStr strnew);     

/* Replace C-type target with single character */
AjBool  ajStrExchangeSetRestCK (AjPStr* Pstr, const char* txt, char chr);            

/* Replace string target with single character */
AjBool  ajStrExchangeSetRestSK (AjPStr* Pstr, const AjPStr str, char chr); 

6.5.13. String Query Functions

The string query functions test the properties of a string.

All functions with the prefix ajStrIs return ajTrue if some basic test of a string is satisfied. The following functions illustrate the scope of the query tests that can be performed but you should see the online documentation for a full list:

/* Alphanumeric characters only. */
AjBool  ajStrIsAlnum (const AjPStr str);   

/* Alphabetic characters only. */
AjBool  ajStrIsAlpha (const AjPStr str);   

/* Represents Boolean value. */
AjBool  ajStrIsBool (const AjPStr str);    

/* Represents integer value. */
AjBool  ajStrIsInt (const AjPStr str);     

/* Represents float value. */
AjBool  ajStrIsFloat (const AjPStr str);   

/* No uppercase alphabetic characters. */
AjBool  ajStrIsLower (const AjPStr str);   

/* Decimal digits only. */
AjBool  ajStrIsNum (const AjPStr str);     

/* Uppercase alphabetic characters only. */
AjBool  ajStrIsUpper (const AjPStr str);   

6.5.14. String Properties and Character Retrieval Functions

For convenience, macros are provided to retrieve the properties of a string including its length, the C-type (char *) string, the usage count and the current reserved size. These functions all return an element of the string C-data structure:

#define   MAJSTRGETLEN(str) str->Len   /* String length         */
#define   MAJSTRGETPTR(str) str->Ptr   /* String char * pointer */
#define   MAJSTRGETRES(str) str->Res   /* Reserved length       */
#define   MAJSTRGETUSE(str) str->Use   /* Usage count           */

Functions are available to return individual characters from a string. These include:

/* Get first character */
char  ajStrGetCharFirst (const AjPStr str);            

/* Get last character */
char  ajStrGetCharLast (const AjPStr str);             

/* Get character from specified position */
char  ajStrGetCharPos (const AjPStr str, ajint pos);   

6.5.15. String Conversion Functions

A string may be converted to some other datatype using one of the following functions:

AjBool  ajStrToBool (const AjPStr str, AjBool* Pval);     /* Convert to boolean          */
AjBool  ajStrToDouble (const AjPStr str, double* Pval);   /* Convert to double           */
AjBool  ajStrToFloat (const AjPStr str, float* Pval);     /* Convert to float            */
AjBool  ajStrToHex (const AjPStr str, ajint* Pval);       /* Convert to hexadecimal      */
AjBool  ajStrToInt (const AjPStr str, ajint* Pval);       /* Convert to integer          */
AjBool  ajStrToLong (const AjPStr thys, ajlong* result);  /* Convert to long             */
AjBool  ajStrToUint (const AjPStr str, ajuint* Pval);     /* Convert to unsigned integer */

In all cases, the functions return ajTrue if the conversion was performed successfully. They take the address of a variable of the appropriate type. For example, to convert a string to an integer value:

    ajint val = 0;
    AjPStr str = NULL;

    str = ajStrNewC("10");

    if(!ajStrToInt(str, &val))
        ajFatal("This error message will not be printed.");

    ajStrDel(&str);

Conversely, the C datatypes can be converted to an EMBOSS string using the following:

AjBool  ajStrFromBool (AjPStr* Pstr, AjBool val);                          /* Convert from double                      */
AjBool  ajStrFromDouble (AjPStr* Pstr, double val, ajint precision);       /* Convert from double                      */
AjBool  ajStrFromDoubleExp (AjPStr* Pstr, double val, ajint precision);    /* Convert from double in exponential form. */
AjBool  ajStrFromFloat (AjPStr* Pstr, float val, ajint precision);         /* Convert from float                       */
AjBool  ajStrFromInt (AjPStr* Pstr, ajint val);                            /* Convert from integer                     */
AjBool  ajStrFromLong (AjPStr* Pstr, ajlong val);                          /* Convert from long                        */
AjBool  ajStrFromUint (AjPStr* Pstr, ajuint val);                          /* Convert from unsigned integer            */  

Again, these functions return ajTrue if the conversion was performed successfully, and take the address of a string. For example, to convert an integer to a string:

    ajint val = 0;
    AjPStr str = NULL;

    str = ajStrNew();
    val = 100;

    if(!ajStrFromInt(&str, val))
        ajFatal("This error message will not be printed.");

    ajStrDel(&str);

6.5.16. String Formatting Functions

Functions to reformat a string have the prefix ajStrFmt. For example, a string or region of a string can be converted to upper or lower case by using:

/* Convert to lower-case        */
AjBool  ajStrFmtLower (AjPStr* Pstr);                                        

/* Convert region to lower-case */
AjBool  ajStrFmtLowerSub (AjPStr* Pstr, ajint pos1, ajint pos2);  

/* Convert to upper-case        */
AjBool  ajStrFmtUpper (AjPStr* Pstr);                             

/* Convert region to upper-case */
AjBool  ajStrFmtUpperSub (AjPStr* Pstr, ajint pos1, ajint pos2);  

The address of the string to be reformatted is passed and ajTrue is returned if the reformatting was successful. You should see the online documentation for other formatting functions.

6.5.17. String Comparison Functions

EMBOSS provides comprehensive string comparison functions.

Functions with the prefix ajStrMatch compare one string with another. The functions perform case-sensitive and case-insensitive comparisons with or without wildcard characters. Variants that take a C-type (char *) string as the second argument are available but not shown:

/* Simple string to C-type string comparison */
AjBool  ajStrMatchC (const AjPStr thys, const char* txt);          

/* Simple string to string comparison */
AjBool  ajStrMatchS (const AjPStr thys, const AjPStr str);         

/* Case-insensitive string to string comparison */
AjBool  ajStrMatchCaseS (const AjPStr thys, const AjPStr str);         

/* String to string comparison with wildcards */
AjBool  ajStrMatchWildS (const AjPStr thys, const AjPStr wild);        

/* Case-insensitive string to string comparison with wildcards */
AjBool  ajStrMatchWildCaseS (const AjPStr thys, const AjPStr wild);    

The following functions will compare the first two words in a string:

/* String to C-type string comparison with wildcards. */   
AjBool  ajStrMatchWildWordC (const AjPStr str, const char* text);       

/* String to string comparison with wildcards.*/
AjBool  ajStrMatchWildWordS (const AjPStr str, const AjPStr text);      

/* Case-insensitive string to C-type string comparison with wildcards.*/
AjBool  ajStrMatchWildWordCaseC (const AjPStr str, const char* text);   

/* Case-insensitive string to string comparison  with wildcards.*/
AjBool  ajStrMatchWildWordCaseS (const AjPStr str, const AjPStr text);  

Functions with the prefix ajStrPrefix or the prefix ajStrSuffix will compare the start or end of a string to the given prefix or suffix respectively. Variants that take a C-type (char *) string as the second argument are available but not shown:

/* Prefix comparison */
AjBool  ajStrPrefixS(const AjPStr str, const AjPStr str2);              

/* Case-insensitive prefix comparison */
AjBool  ajStrPrefixCaseS (const AjPStr str, const AjPStr pref);         

/* Suffix comparison */
AjBool  ajStrSuffixS (const AjPStr thys, const AjPStr suff);            

/* Case-insensitive suffix comparison */
AjBool  ajStrSuffixCaseS (const AjPStr str, const AjPStr pref);  

6.5.18. String Search Functions

String search functions have the prefix ajStrFind and are used to find substrings or characters within strings:

/* Find a string */
ajint  ajStrFindS (const AjPStr str, const AjPStr str2);           

/* Find a character */
ajint  ajStrFindAnyK(const AjPStr str, char chr);                  

/* Find any character in a set */
ajint  ajStrFindAnyS (const AjPStr str, const AjPStr str2);        

/* Find a string (case-insensitive) */
ajint  ajStrFindCaseS (const AjPStr str, const AjPStr str2);       

/* Find any character not in a set */
ajint  ajStrFindRestS (const AjPStr str, const AjPStr str2);       

/* Find any character not in a set (case-insensitive) */
ajint  ajStrFindRestCaseS (const AjPStr str, const AjPStr str2);   

/* Find last occurence of a string */
ajint  ajStrFindlastS (const AjPStr str, const AjPStr str2); 

These functions return the position of the start of the search text in the sequence, or -1 if the text was not found.

ajStrFindAnyS, ajStrFindRestS, ajStrFindRestCaseS use a set of characters provided as a string (str2).

6.5.19. String Parsing Functions

Functions for parsing text tokens from strings have the prefix ajStrExtract or the prefix ajStrParse.

To extract the first word (Pword) and the remainder of the string (Prest) from an input string (str) use either of:

/* Remove first word (with no leading spaces) from a string */
AjBool  ajStrExtractFirst (const AjPStr str, AjPStr* Prest, AjPStr* Pword);  

/* Remove first word from a string, skipping spaces */
AjBool  ajStrExtractWord (const AjPStr str, AjPStr* Prest, AjPStr* Pword);

ajStrExtractWord will skip any leading whitespace whereas ajStrExtractFirst will return ajFalse if the input string starts with a space. Like most of the string functions they will allocate memory for the strings if necessary although it is cleaner to allocate the strings manually. In the example below, ajStrExtractFirst will return ajFalse and the printed strings will be empty, whereas ajStrExtractFirst will print the first word (First) and the rest of the string ( word in this string is 'First') successfully:

    AjPStr inputstring = NULL;
    AjPStr word        = NULL;
    AjPStr rest        = NULL;

    inputstring = ajStrNewC("  First word in this string is 'First'");
    word        = ajStrNew();
    rest        = ajStrNew();

    ajStrExtractFirst(inputstring, &rest, &word);
    ajFmtPrint("word: %S\n", word);    /* Empty */
    ajFmtPrint("rest: %S\n", rest);    /* Empty */

    ajStrExtractWord(inputstring, &rest, &word);
    ajFmtPrint("word: %S\n", word);    /* First */
    ajFmtPrint("rest: %S\n", rest);    /*  word in this string is 'First' */

    ajStrDel(&inputstring);
    ajStrDel(&word);
    ajStrDel(&rest);

There is a function to split a newline-separated multi-line string into an array of strings:

ajuint     ajStrParseSplit(const AjPStr str, AjPStr **PPstr);

The function allocates memory for an array of strings (which must be freed later) and returns the number of array elements created:

    AjPStr  inputstring = NULL;
    AjPStr *array       = NULL;
    ajint   dim;
    ajint   x;

    inputstring = ajStrNewC("First line\nSecond line\nThird line\n");

    dim = ajStrParseSplit(inputstring, &array);

    for(x=0; x<dim; x++)
        ajFmtPrint("array[%d]: %S\n", x, array[x]);

    ajStrDel(&inputstring);

    for(x=0; x<dim; x++)
        ajStrDel(&array[x]);

    AJFREE(array);

6.5.20. String Iteration

String iteration allows you to step through a string a single character at a time. The AJAX datatype for this is:

AjIStr

String iteration object.

To iterate through a string you must first instantiate the string iteration object. Two constructors are provided for forward (start to end) or reverse (end to start) iteration:

/* Constructor for forward iteration */
AjIStr  ajStrIterNew (const AjPStr thys);      

/* Constructor for reverse iteration */
AjIStr  ajStrIterNewBack (const AjPStr thys);   

To iterate through a string use either of the following functions. They return NULL if iteration cannot continue:

AjIStr  ajStrIterNext (AjIStr iter);
AjIStr  ajStrIterNextBack (AjIStr iter);

To retrieve the character or the remainder of the string at the current position use:

/* Retrieve the character */
char  ajStrIterGetK (const AjIStr iter);    

/* Retrieve the remainder of the string */
const char*  ajStrIterGetC (const AjIStr iter);   

To change the character at the current position use:

void  ajStrIterPutK (AjIStr iter, char chr);

The following functions return ajTrue if iteration can continue and are also used to control iteration:

/* Test if iteration can continue */
AjBool  ajStrIterDone (const AjIStr iter);       

/* Test if reverse iteration can continue */
AjBool  ajStrIterDoneBack (const AjIStr iter); 

A string iterator can be reset so that it points to the start or end of the string:

/* Reset forward iteration to start of string */
void  ajStrIterBegin (AjIStr iter);   

/* Reset reverse iteration to end of string */
void  ajStrIterEnd(AjIStr iter);  

Once you are done, you must free the string iteration object:

/* Destructor for iteration object */
void  ajStrIterDel (AjIStr *iter); 

The example code below uses the iteration functions to iterate through a string, replacing all dash characters with full stops and writing a new string:

    AjPStr   str  = NULL;
    AjPStr   seq  = NULL;
    AjIStr   iter = NULL;
    char     chr;


    str = ajStrNewC("--AALIY---TIWLASL--");
    seq = ajStrNew();

    iter = ajStrIterNew(str);

    while(ajStrIterNext(iter))
    {
        if((chr = ajStrIterGetK(iter)) == '-')
            ajStrIterPutK(iter, '.');
        else
            ajStrAppendK(&seq, chr); 
    }

    ajFmtPrint("str: %S\n", str);
    ajFmtPrint("seq: %S\n", seq);

    ajStrDel(&str);
    ajStrDel(&seq);
    ajStrIterDel(&iter);

6.5.21. String Tokenisation

There is a dedicated AJAX datatype for string tokenisation:

AjPStrTok

String tokenisation object.

String tokenisation functions have the prefix ajStrToken and are used to delimit a string into text tokens and extract them.

To tokenise a string you must first instantiate the string tokenisation object. The constructors provided take the delimiter characters either as a C-type (char *) or as an EMBOSS string (AjPStr) and return a pointer to a string tokenisation object:

AjPStrTok  ajStrTokenNewC (const AjPStr str, const char* txtdelim);    
AjPStrTok  ajStrTokenNewS (const AjPStr str, const AjPStr strdelim);

The sequence to be tokenised and a delimiter string can also be set for an existing string tokenisation object:

/* Specify string to be tokenised only */
AjBool  ajStrTokenAssign (AjPStrTok*  Ptoken, const AjPStr str);                         

/* Specify string and delimiters */
AjBool  ajStrTokenAssignC (AjPStrTok* Ptoken, const AjPStr str, const char* txtdelim);   

/* Specify string and delimiters */
AjBool  ajStrTokenAssignS (AjPStrTok* Ptoken, const AjPStr str, const AjPStr strdelim);  

These functions will allocate the string tokenisation object if necessary and can therefore be used as an alternative to the constructor function. It is however much clearer if they are only used to update an object that has been created using the standard constructors.

To parse individual tokens from a string call:

AjBool  ajStrTokenNextFind (AjPStrTok* Ptoken, AjPStr* Pstr);
AjBool  ajStrTokenNextFindC (AjPStrTok* Ptoken, const char* strdelim, AjPStr* Pstr);

ajStrTokenNextFindC will update the string tokenisation object with the string of delimiters (strdelim).

To return the remainder of a string that's been partially parsed call:

AjBool  ajStrTokenRestParse (AjPStrTok* Ptoken, AjPStr* Pstr);

If you want the delimiter to be treated as a string rather than individual characters, in other words to tokenise the string using another string, use the following to parse individual tokens from the string:

AjBool  ajStrTokenNextParse (AjPStrTok* Ptoken, AjPStr* Pstr);
AjBool  ajStrTokenNextParseC (AjPStrTok* Ptoken, const char* txtdelim, AjPStr* Pstr);
AjBool  ajStrTokenNextParseS (AjPStrTok* Ptoken, const AjPStr strdelim, AjPStr* Pstr);

These functions return ajTrue if the token was parsed successfully or ajFalse otherwise. ajStrTokenNextParseC and ajStrTokenNextParseS will update the string tokenisation object with the string of delimiters (strdelim or txtdelim provided. Note that these functions can return ajTrue but write an empty token (Pstr) in cases where the delimeter has been changed since the previous call.

The string tokenisation object can be reset (all strings cleared) so that it is ready for re-use by calling:

void  ajStrTokenReset (AjPStrTok* Ptoken);

Once you are done you must free the string tokenisation object:

void  ajStrTokenDel (AjPStrTok* Ptoken);

The example code below uses the string tokenisation object and its functions to retrieve individual lines from a string that contains multiple newline characters:

    AjPStr    inputstring = NULL;
    AjPStr    token       = NULL;
    AjPStrTok tokens      = NULL;

    inputstring = ajStrNewC("First line\nSecond line\nThird line\n");
    tokens = ajStrTokenNewC(inputstring, "\n");

    while(ajStrTokenNextFind(&tokens, &token))
        ajFmtPrint("token: %S\n", token);

    ajStrTokenDel(&tokens);
    ajStrDel(&inputstring);

A string (str or txt) may be tokenised with either whitespace characters or a specified set of delimiters given as a text string (txtdelim):

/* Tokenise C-type string by set of delimiters */
AjPStr  ajCharParseC (const char* txt, const char* delim);      

/* Tokenise by set of delimiters */
const AjPStr  ajStrParseC (const AjPStr str, const char* txtdelim);  

/* Tokenise by whitespace */
const AjPStr  ajStrParseWhite (const AjPStr str);

These functions use the C strtok function and return tokens from the string. The first time the function is called it is passed the string to be parsed. For subsequent calls on the same string it is passed NULL as the first argument. A pointer to the token is returned or NULL when all tokens have been parsed:

    AjPStr inputstring = NULL;
    AjPStr token       = NULL;


    inputstring = ajStrNewC("  First word in this string is 'First'");

    token = ajStrParseWhite(inputstring)

    /* Prints 'First' to the screen */
    ajFmtPrint("token: %S\n", token);        

    /* Prints the rest of the words, one word at a time */
    while(token = ajStrParseWhite(NULL))
        ajFmtPrint("token: %S\n", token);     

    ajStrDel(&inputstring);

To count the tokens in a string in which the tokens are delimited by either whitespace characters or a specified set of delimiters (strdelim) use:

/* Count tokens delimited by whitespace */
ajuint  ajStrParseCount (const AjPStr line);                           

/* Count tokens delimited by set of delimiters */
ajuint  ajStrParseCountS (const AjPStr line, const AjPStr strdelim); 

6.5.22. Handling C-type (char *) Strings

For convenience, several groups of functions are provided for handling C-type (char *) strings. They all have the prefix ajChar to distinguish them from the other functions.

6.5.22.1. C-type string constructor and destructor functions

In the same way as for EMBOSS strings, a C-type string can be created from a starting C-type (char *) or string object, with or without a reserved size:

/* Create a string from a C-type string     */
char*   ajCharNewC (const char* txt);                                   

/* Create a string from a string object     */
char*   ajCharNewS (const AjPStr thys);                                 

/* Create an empty string of reserved size. */
char*   ajCharNewRes(ajuint size);                                      

/* Create a string with reserved size from a C-type string  */
char*   ajCharNewResC(const char* txt, ajuint size);                    

/* Create a string with reserved size from a string object  */ 
char*   ajCharNewResS(const AjPStr str, ajuint size);                   

/* Create a string from a C-type string with specified length */
char*   ajCharNewResLenC(const char* txt, ajuint size, ajuint len);  

In all cases a pointer to the allocated memory is returned, which must be freed once you are done with it. To delete a C-type string call:

void  ajCharDel (char** Ptxt);

For example:

    char *string = NULL;

    string = ajCharNewC("This is a text string");

    ajCharDel(&string);

6.5.22.2. C-type string comparison functions

Most of the string comparison functions (Section 6.5.17, “String Comparison Functions”) are available for C-type strings too:

AjBool  ajCharMatchC (const char* txt1, const char* txt2);
AjBool  ajCharMatchCaseC (const char* txt1, const char* txt2);
AjBool  ajCharMatchWildC (const char* txt1, const char* txt2);
AjBool  ajCharMatchWildCaseC (const char* txt1, const char* txt2);
AjBool  ajCharMatchWildNextC (const char* txt1, const char* txt2);
AjBool  ajCharMatchWildWordC (const char* str, const char* txt);
AjBool  ajCharMatchWildNextCaseC (const char* txt1, const char* txt2);
AjBool  ajCharMatchWildWordCaseC (const char* str, const char* txt);
AjBool  ajCharPrefixC (const char* txt, const char* pref);
AjBool  ajCharPrefixCaseC (const char* txt, const char* pref);
AjBool  ajCharSuffixC (const char* txt, const char* suff);
AjBool  ajCharSuffixCaseC (const char* txt, const char* suff);

Variants of these functions that take a string object as the second argument are available (not shown) and have the suffix S rather than C.

6.5.23. Handling String Formatting

Functions for formatting and printing a string are defined in the library file ajfmt.h/c. Conversion characters are defined for all the EMBOSS fundamental datatypes (Section 5.1, “Basic Datatypes”) and are an extension of the basic C conversion codes. They are:

  • An AJAX string object AjPStr is printed by %S.

  • A C-type (char*) string is printed by %s. If the pointer is NULL then <null> is printed.

  • The filename of an AJAX file object AjPFile is printed by %F.

  • A boolean variable AjBool is printed by %b (for output as "Y/N") or %B ("Yes/No")

  • The date and time (AjPDate) is printed by %D.

Functions with the prefix ajFmtScan are equivalent to the C *scan* functions:

ajint  ajFmtScanS (const AjPStr str, const char* fmt, ...);
ajint  ajFmtScanC (const char* txt, const char* fmt, ...); 
ajint  ajFmtScanF (AjPFile thys, const char* fmt, ...);    

Functions with the prefix ajFmtPrint are equivalent to the C *print* functions:

/* format and emit the "..." arguments according to fmt; writes to stdout */
void  ajFmtPrint (const char *fmt, ...);

/* format and emit the "..." arguments according to fmt; writes to a file object */
void  ajFmtPrintF (AjPFile file,
                   const char *fmt, ...);

/* format and emit the "..." arguments according to fmt; writes to  C FILE stream */
void  ajFmtPrintFp (FILE *stream,
                    const char *fmt, ...);

/* formats the "..." arguments into a buffer with a maximum size according to fmt */
ajint  ajFmtPrintCL (char *buf, ajint size,
                     const char *fmt, ...);

/* Block and print a string. String is split at given delimiters */
void  ajFmtPrintSplit(AjPFile outf, const AjPStr str,
                      const char *prefix, ajint len,
		      const char *delim);

/* Formats the "..." arguments into an AjPStr according to fmt */
AjPStr  ajFmtPrintS (AjPStr *pthis, const char *fmt, ...) ;

/* Formats the "..." arguments and appends to an AjPStr according to fmt */
AjPStr  ajFmtPrintAppS (AjPStr *pthis, const char *fmt, ...) ;