Efficient and flexible string handling is fundamental to molecular sequence manipulation. Accordingly, string handling is the best developed area in the AJAX library. The functionality, which is spread over several library files, is comprehensive and includes:
String construction
String referencing and dereferencing where a handle on, but not a copy of, a string is required
Assignment functions to assign a value to a string
Functions to combine two strings or parts of a string. The types of operation include appending, insertion and pasting (overwriting character positions)
Cut functions to remove substrings, regions or characters from a target string
Substitutions of characters or substrings of a string with other characters/substrings
Query functions to test the properties of a string
Retrieval of characters and string properties (such as length)
Conversion functions to convert a string to some other datatype
String formatting
String comparison functions
Search functions to find substrings or characters in strings
String parsing functions to parse text tokens from strings
String iteration, which allows you to step through a string a single character at a time
String tokenisation
Formatting and printing. Conversion characters are defined for all the EMBOSS fundamental datatypes (Section 5.1, “Basic Datatypes”) and are an extension of the basic C conversion codes
For convenience, most functions implemented for an AJAX string parameter have a corresponding function with a C-type (char *
) string parameter. A string may be defined in the ACD file and retrieved from the C source code by a call to ajAcdGetString
. More typically though, strings are created directly in the code.
In contrast to standard C-type (char *
) strings, the AJAX string object (AjPStr
) is dynamic; memory is (re)allocated as needed so that you never run out of space when calling string functions. AJAX strings are reference counted. The object keeps track of how many references (pointers to) the string there are in the code. It is not until all references to a string have been deleted that the string itself is freed.
The string object definition is shown below:
typedef struct AjSStr { ajuint Res; ajuint Len; char *Ptr; ajuint Use; ajint Padding; } AjOStr; #define AjPStr AjOStr* typedef AjPStr* AjPPStr;
Ptr
holds the character string and Len
is its length. In contrast to C-type strings the character string may or may not be NULL
terminated. The library functions for printing AjPStr
objects uses the length field (Len
) for how many characters to print and won't stop at the first NULL
if there is one.
Res
is the reserved dynamic memory associated with the object and is always at least equal to Len
but is often more. It is used for handling dynamic reallocation of string memory. Use
is the string reference counter mentioned above. Finally, the Padding
element pads the string to an alignment boundary (to mollify strict compilers).
The string object and the internals of string memory management are described in greater detail elsewhere (Section 5.5, “Programming with Objects”).
AJAX library files for handling strings are listed in the table (Table 6.5, “AJAX Library Files for Handling Strings”). Library file documentation, including a complete description of datatypes and functions, is available at:
http://emboss.open-bio.org/rel/dev/libs/ |
Library File Documentation | Description |
---|---|
ajstr | General string handling |
ajfmt | String formatting functions |
ajstr.h/c
. Most of the functions you will ever need for general string handling. They define the basic string object AjPStr
, string iteration object (AjIStr
and string token parser object (AjPStrTok
) for use with the functions.
ajfmt.h/c
. Functions for string formatting. The functions are similar to the C functions printf
, fprintf
etc, but the set of conversion specifiers and other functionality is extended. They also contain a static data structure and functions for handling formatting at a low level (Section 6.5.23, “Handling String Formatting”).
You are unlikely to need the static data structures and functions unless you plan to extend the string library.
A typical ACD definition for string input:
string: delimiter [ default: "|" information: "Delimiter of records in text output file" knowntype: "output delimiter" ]
A standard parameter name (Section A.1.3, “Parameter Naming Conventions”) might be used. This depending on the specific use-case of the data definition.
Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).
default:
A global attribute that specifies a default value.
information:
A global attribute that specifies the user-prompt and is used in the application documentation.
knowntype:
This global attribute should always be specified for string inputs. If the output is not of any of the standard EMBOSS known types (Section 4.3.5.3.1, “Application Data Known Types File (knowntypes.standard
)”) then
is the recommended value. .ApplicationName
output
For handling strings, including those defined in the ACD file (string
ACD datatype), use:
AjPStr
String.
Two datatypes are for string-related operations:
Datatypes and functions for handling string input via the ACD file are shown below (Table 6.6, “Datatypes and Functions for String Input”).
To read a string | |
ACD datatype | string |
---|---|
AJAX datatype | AjPStr |
To retrieve from ACD | ajAcdGetString |
To retrieve an input string an object pointer is declared and then initialised using ajAcdGetString
:
AjPStr delimiter = NULL; delimiter = ajAcdGetString("delimiter");
To use a string object that is not defined in the ACD file you must first instantiate the appropriate object pointer. The default string construction function is:
/* Create a string object. */ AjPStr ajStrNew (void);
All constructors return the address of a new object. The pointers do not need to be initialised to NULL
but it is good practice to do so:
AjPStr str = NULL; str = ajStrNew(); /* The object is instantiated and ready for use */
You must free the memory for an object once you are finished with it. The default string destructor function is:
/* Delete a string object. */ AjPStr ajStrDel (AjPStr *Pstr);
It is the responsibility of the calling function to destroy any objects
AjPStr str = NULL; str = ajStrNew(); /* Do something with the instantiated object */ ajStrDel(&str); /* The memory is freed and the pointer reset to NULL, ready for re-use. */ str = ajStrNew(); /* Do something else with the new object. The pointer variable is reallocated. */ ajStrDel(&str); /* Done with the object so the memory is freed. */
A variety of alternative string constructor functions are available. A string can be constructed from an existing string object (AjPStr
) or C-type (char *
) string, with an optional reserved size:
/* Construct from C-type string */ AjPStr ajStrNewC (const char *txt); /* Construct from C-type string with reserved size */ AjPStr ajStrNewResC (const char *txt, ajuint size); /* Construct from C-type string with explicit reserved size */ AjPStr ajStrNewResLenC (const char *txt, ajuint size, ajuint len); /* Construct with reserved size */ AjPStr ajStrNewRes(ajuint size); /* Construct from string object */ AjPStr ajStrNewS (const AjPStr str); /* Construct from string object with reserved size */ AjPStr ajStrNewResS (const AjPStr str, ajuint size);
ajStrNewResLenC
is identical to ajStrNewResC
except that the string length is passed to ajStrNewResLenC
for speed.
They are all used in same way as the default constructor i.e. they return a pointer to the new object.
There is a string referencing function:
/* Reference an existing string */ AjPStr ajStrNewRef (AjPStr str);
In contrast to the other constructor functions ajStrNewRef
does not create a new object but instead returns a pointer to the string passed in and increases its reference count.
There is a string dereferencing function:
/* Dereference an existing string */ AjBool ajStrDelStatic (AjPStr* Pstr);
ajStrDelStatic
will set the string pointer to NULL
and decrement the use count of the string to which it refers. In contrast to the default destructor function, strings with a use count of 1 are not freed to avoid freeing and reallocating memory when they are reused. Memory reserved for the string is never deleted by this function and can be reused.
The string assignment functions will assign a value to a string. A string can be assigned from a character, an existing string object (AjPStr
) or C-type (char *
) string, or a substring of an appropriate datatype. Some function variants allow optional reserved sizes to be specified:
/* Assign from character */ AjBool ajStrAssignK (AjPStr* Pstr, char chr); /* Assign from C-type string */ AjBool ajStrAssignC (AjPStr* Pstr, const char* txt); /* Assign from string object */ AjBool ajStrAssignS (AjPStr* Pstr, const AjPStr str); /* Assign from C-type string up to a given length */ AjBool ajStrAssignLenC (AjPStr* Pstr, const char* txt, ajuint ilen); /* Assign using a pointer only. The reference count is incremented */ AjBool ajStrAssignRef (AjPStr* Pstr, AjPStr refstr); /* Assign from C-type string with reserved size */ AjBool ajStrAssignResC (AjPStr* Pstr, ajuint size, const char* txt); /* Assign from string object with reserved size */ AjBool ajStrAssignResS (AjPStr* Pstr, ajuint i, const AjPStr str); /* Assign from substring of C-type string */ AjBool ajStrAssignSubC (AjPStr* Pstr, const char* txt, ajint pos1, ajint pos2); /* Assign from substring of string object */ AjBool ajStrAssignSubS (AjPStr* Pstr, const AjPStr str, ajint pos1, ajint pos2);
ajStrAssignLenC
is identical to ajStrAssignC
except that the source string is only copied up to a specified length.
Memory for the string is allocated to NULL
target pointers if necessary, although to keep the calling code intuitive we strongly recommend that a string object is first instantiated by calling ajStrNew
before any of these functions are used.
For example, in the following code it is clear you are dealing with two separate strings:
AjPStr str = NULL; AjPStr strcopy = NULL; str = ajStrNewC("A string"); strcopy = ajStrNew(); if(!ajStrAssignC(&strcopy, str)) ajFatal("String not assigned"); ajStrDel(&str); ajStrDel(&strcopy);
Whereas the following code is perfectly valid but is less clear:
AjPStr str = NULL; AjPStr strcopy = NULL; str = ajStrNewC("A string"); if(!ajStrAssignC(&strcopy, str)) ajFatal("String not assigned"); ajStrDel(&str); ajStrDel(&strcopy);
The string combination functions will combine two strings together. They fall into a variety of classes described below.
The string append functions will append a source string to a target string. An individual character or multiple characters, an existing string object (AjPStr
) or C-type (char *
) string, or a substring of either of the latter can be appended:
/* Append a C-type string */ AjBool ajStrAppendC (AjPStr* Pstr, const char* txt); /* Append a single character */ AjBool ajStrAppendK (AjPStr* Pstr, char chr); /* Append a string object */ AjBool ajStrAppendS (AjPStr* Pstr, const AjPStr str); /* Append multiples of a single character */ AjBool ajStrAppendCountK (AjPStr* Pstr, char chr, ajuint num); /* Append a C-type string up to a given length */ AjBool ajStrAppendLenC (AjPStr* Pstr, const char* txt, ajuint len); /* Append a substring of a string object */ AjBool ajStrAppendSubS (AjPStr* Pstr, const AjPStr str, ajint pos1, ajint pos2);
ajStrAppendLenC
is identical to ajStrAppendC
except that a region from the source string up to a specified length is appended.
The string join functions are similar to the append functions except that they cut the source and target strings at specified positions before appending:
/* Cut down string at pos1 and add string2 from position pos2. */ AjBool ajStrJoinC (AjPStr* Pstr, ajint pos1, const char* txt, ajint pos2); AjBool ajStrJoinS (AjPStr* Pstr, ajint pos1, const AjPStr str, ajint pos2);
The string insert functions will insert a character, an existing string object (AjPStr
) or C-type (char *
) string into a string:
/* Insert a C-type string */ AjBool ajStrInsertC (AjPStr* pthis, ajint pos, const char* str); /* Insert a character */ AjBool ajStrInsertK (AjPStr* pthis, ajint begin, char insert); /* Insert a string */ AjBool ajStrInsertS (AjPStr* pthis, ajint pos, const AjPStr str);
The string paste functions will overwrite the target string with the source string (or character) at a specified point (pos
), using (optionally) up to a specified number of characters from the source string:
/* Paste string */ AjBool ajStrPasteS( AjPStr* Pstr, ajint pos, const AjPStr str); /* Paste specified number of characters */ AjBool ajStrPasteCountK(AjPStr* Pstr, ajint pos, char chr, ajuint num); /* Paste portion of C-type string */ AjBool ajStrPasteMaxC (AjPStr* Pstr, ajint pos, const char* txt, ajuint n); /* Paste portion of string object */ AjBool ajStrPasteMaxS( AjPStr* Pstr, ajint pos, const AjPStr str, ajuint n);
In addition there is a string masking function which will replace all characters in the target string with a mask character over a specified range:
/* Replace all characters in a region with mask characters */ AjBool ajStrMaskRange(AjPStr* str, ajint begin, ajint end, char maskchar);
The string cut functions will remove regions or individual characters from a target string. A selection of the available functions in various functional categories are described below. All the functions return ajTrue
if the operation was performed successfully or ajFalse
otherwise.
A number of characters can be removed from the start, end or interior of a string using:
/* Removes a number of characters from the start of a string. */ AjBool ajStrCutStart(AjPStr* Pstr, ajuint len); /* Removes a number of characters from the end of a string. */ AjBool ajStrCutEnd(AjPStr* Pstr, ajuint len); /* Removes a region from a string. */ AjBool ajStrCutRange(AjPStr* Pstr, ajint pos1, ajint pos2);
Functions to remove characters from a string include:
/* Removes non-sequence characters (all but alphabetic characters and asterisk) */ AjBool ajStrRemoveGap(AjPStr* thys); /* Removes HTML mark-up from a string. */ AjBool ajStrRemoveHtml(AjPStr* pthis); /* Removes last character from a string if it is a newline character. */ AjBool ajStrRemoveLastNewline(AjPStr* Pstr); /* Removes all of a given set of characters from a string. */ AjBool ajStrRemoveSetC(AjPStr* Pstr, const char *txt); /* Removes all whitespace characters from a string. */ AjBool ajStrRemoveWhite(AjPStr* Pstr); /* Removes excess whitespace characters from a string. */ AjBool ajStrRemoveWhiteExcess(AjPStr* Pstr); /* Removes excess space characters from a string. */ AjBool ajStrRemoveWhiteSpaces(AjPStr* Pstr); /* Removes all characters after the first wildcard character (if found). */ AjBool ajStrRemoveWild(AjPStr* Pstr);
ajStrRemoveWhiteExcess
and ajStrRemoveWhiteSpaces
both remove the leading/trailing whitespace from a string and replace multiple spaces with a single space. Additionally, ajStrRemoveWhiteSpaces
converts tabs to spaces but leaves newline characters unchanged.
Functions are available to remove a region from a string or all characters in a string other than those in a defined set. The character sets can be provided either as a string object (AjPStr
) or C-type (char *
) string:
/* Trim sequence down to a defined range */ AjBool ajStrKeepRange(AjPStr* Pstr, ajint pos1, ajint pos2); /* Removes all characters that are not in a given set. */ AjBool ajStrKeepSetC(AjPStr* Pstr, const char* txt); /* Removes all characters that are not in a given set. */ AjBool ajStrKeepSetS(AjPStr* Pstr, const AjPStr str); /* Removes all characters that are not alphabetic. AjBool ajStrKeepSetAlpha(AjPStr* Pstr); /* Removes all characters that are not alphabetic and are not in a given set. */ AjBool ajStrKeepSetAlphaC(AjPStr* Pstr, const char* txt);
The string trim functions below will remove region(s) of a given character composition (provided in the string txt
) from the start and/or end of a string:
/* Remove from start of a string */ AjBool ajStrTrimStartC (AjPStr* Pstr, const char* txt); /* Remove from end of a string */ AjBool ajStrTrimEndC (AjPStr* Pstr, const char* txt); /* Remove from start and end of a string */ AjBool ajStrTrimC (AjPStr* pthis, const char* txt);
All characters will be removed from the start and/or end up to the first character that is not in the set provided.
Similar functions are provided to remove regions composed of white space characters only from the start and end of a string.
/* Remove from start and end of a string. */ AjBool ajStrTrimWhite (AjPStr* Pstr); /* Remove from start of a string. */ AjBool ajStrTrimWhiteStart (AjPStr* Pstr); /* Remove from end of a string. */ AjBool ajStrTrimWhiteEnd (AjPStr* Pstr);
There are also two truncate functions which remove characters from the end of a string reducing it to a defined length (ajStrTruncateLen
) or cut the end off a string at a defined position (ajStrTruncatePos
):
AjBool ajStrTruncateLen (AjPStr* Pstr, ajuint len); AjBool ajStrTruncatePos (AjPStr* Pstr, ajint pos);
The string substitution functions will perform substitutions of characters or substrings of a string with other characters/substrings.
Functions with the prefix ajStrExchange
will replace all occurrences in a string of one substring (or character) with another string (or character). Variants of the function support string objects (AjPStr
) and C-type (char *
) strings for the target and replacement substrings:
/* C-type string target and replacement. */ AjBool ajStrExchangeCC (AjPStr* Pstr, const char* txt, const char* txtnew); /* C-type string target, string replacement */ AjBool ajStrExchangeCS (AjPStr* Pstr, const char* txt, const AjPStr strnew); /* Character target and replacement */ AjBool ajStrExchangeKK (AjPStr* Pstr, char chr, char chrnew); /* String target, C-type string replacement */ AjBool ajStrExchangeSC (AjPStr* Pstr, const AjPStr str, const char* txtnew); /* String target and replacement */ AjBool ajStrExchangeSS (AjPStr* Pstr, const AjPStr str, const AjPStr strnew);
Functions with the prefix ajStrExchangeSet
are similar except that they replace all occurrences in a string of one set of characters with another character or set of characters. Variants of the function use string objects (AjPStr
) and C-type (char *
) strings to define the sets:
/* C-type string target and replacement sets */ AjBool ajStrExchangeSetCC (AjPStr* Pstr, const char* txt,const char* newc); /* String target and replacement sets */ AjBool ajStrExchangeSetSS (AjPStr* Pstr, const AjPStr str,const AjPStr strnew); /* Replace C-type target with single character */ AjBool ajStrExchangeSetRestCK (AjPStr* Pstr, const char* txt, char chr); /* Replace string target with single character */ AjBool ajStrExchangeSetRestSK (AjPStr* Pstr, const AjPStr str, char chr);
The string query functions test the properties of a string.
All functions with the prefix ajStrIs
return ajTrue
if some basic test of a string is satisfied. The following functions illustrate the scope of the query tests that can be performed but you should see the online documentation for a full list:
/* Alphanumeric characters only. */ AjBool ajStrIsAlnum (const AjPStr str); /* Alphabetic characters only. */ AjBool ajStrIsAlpha (const AjPStr str); /* Represents Boolean value. */ AjBool ajStrIsBool (const AjPStr str); /* Represents integer value. */ AjBool ajStrIsInt (const AjPStr str); /* Represents float value. */ AjBool ajStrIsFloat (const AjPStr str); /* No uppercase alphabetic characters. */ AjBool ajStrIsLower (const AjPStr str); /* Decimal digits only. */ AjBool ajStrIsNum (const AjPStr str); /* Uppercase alphabetic characters only. */ AjBool ajStrIsUpper (const AjPStr str);
For convenience, macros are provided to retrieve the properties of a string including its length, the C-type (char *
) string, the usage count and the current reserved size. These functions all return an element of the string C-data structure:
#define MAJSTRGETLEN(str) str->Len /* String length */ #define MAJSTRGETPTR(str) str->Ptr /* String char * pointer */ #define MAJSTRGETRES(str) str->Res /* Reserved length */ #define MAJSTRGETUSE(str) str->Use /* Usage count */
Functions are available to return individual characters from a string. These include:
/* Get first character */ char ajStrGetCharFirst (const AjPStr str); /* Get last character */ char ajStrGetCharLast (const AjPStr str); /* Get character from specified position */ char ajStrGetCharPos (const AjPStr str, ajint pos);
A string may be converted to some other datatype using one of the following functions:
AjBool ajStrToBool (const AjPStr str, AjBool* Pval); /* Convert to boolean */ AjBool ajStrToDouble (const AjPStr str, double* Pval); /* Convert to double */ AjBool ajStrToFloat (const AjPStr str, float* Pval); /* Convert to float */ AjBool ajStrToHex (const AjPStr str, ajint* Pval); /* Convert to hexadecimal */ AjBool ajStrToInt (const AjPStr str, ajint* Pval); /* Convert to integer */ AjBool ajStrToLong (const AjPStr thys, ajlong* result); /* Convert to long */ AjBool ajStrToUint (const AjPStr str, ajuint* Pval); /* Convert to unsigned integer */
In all cases, the functions return ajTrue
if the conversion was performed successfully. They take the address of a variable of the appropriate type. For example, to convert a string to an integer value:
ajint val = 0; AjPStr str = NULL; str = ajStrNewC("10"); if(!ajStrToInt(str, &val)) ajFatal("This error message will not be printed."); ajStrDel(&str);
Conversely, the C datatypes can be converted to an EMBOSS string using the following:
AjBool ajStrFromBool (AjPStr* Pstr, AjBool val); /* Convert from double */ AjBool ajStrFromDouble (AjPStr* Pstr, double val, ajint precision); /* Convert from double */ AjBool ajStrFromDoubleExp (AjPStr* Pstr, double val, ajint precision); /* Convert from double in exponential form. */ AjBool ajStrFromFloat (AjPStr* Pstr, float val, ajint precision); /* Convert from float */ AjBool ajStrFromInt (AjPStr* Pstr, ajint val); /* Convert from integer */ AjBool ajStrFromLong (AjPStr* Pstr, ajlong val); /* Convert from long */ AjBool ajStrFromUint (AjPStr* Pstr, ajuint val); /* Convert from unsigned integer */
Again, these functions return ajTrue
if the conversion was performed successfully, and take the address of a string. For example, to convert an integer to a string:
ajint val = 0; AjPStr str = NULL; str = ajStrNew(); val = 100; if(!ajStrFromInt(&str, val)) ajFatal("This error message will not be printed."); ajStrDel(&str);
Functions to reformat a string have the prefix ajStrFmt
. For example, a string or region of a string can be converted to upper or lower case by using:
/* Convert to lower-case */ AjBool ajStrFmtLower (AjPStr* Pstr); /* Convert region to lower-case */ AjBool ajStrFmtLowerSub (AjPStr* Pstr, ajint pos1, ajint pos2); /* Convert to upper-case */ AjBool ajStrFmtUpper (AjPStr* Pstr); /* Convert region to upper-case */ AjBool ajStrFmtUpperSub (AjPStr* Pstr, ajint pos1, ajint pos2);
The address of the string to be reformatted is passed and ajTrue
is returned if the reformatting was successful. You should see the online documentation for other formatting functions.
EMBOSS provides comprehensive string comparison functions.
Functions with the prefix ajStrMatch
compare one string with another. The functions perform case-sensitive and case-insensitive comparisons with or without wildcard characters. Variants that take a C-type (char *
) string as the second argument are available but not shown:
/* Simple string to C-type string comparison */ AjBool ajStrMatchC (const AjPStr thys, const char* txt); /* Simple string to string comparison */ AjBool ajStrMatchS (const AjPStr thys, const AjPStr str); /* Case-insensitive string to string comparison */ AjBool ajStrMatchCaseS (const AjPStr thys, const AjPStr str); /* String to string comparison with wildcards */ AjBool ajStrMatchWildS (const AjPStr thys, const AjPStr wild); /* Case-insensitive string to string comparison with wildcards */ AjBool ajStrMatchWildCaseS (const AjPStr thys, const AjPStr wild);
The following functions will compare the first two words in a string:
/* String to C-type string comparison with wildcards. */ AjBool ajStrMatchWildWordC (const AjPStr str, const char* text); /* String to string comparison with wildcards.*/ AjBool ajStrMatchWildWordS (const AjPStr str, const AjPStr text); /* Case-insensitive string to C-type string comparison with wildcards.*/ AjBool ajStrMatchWildWordCaseC (const AjPStr str, const char* text); /* Case-insensitive string to string comparison with wildcards.*/ AjBool ajStrMatchWildWordCaseS (const AjPStr str, const AjPStr text);
Functions with the prefix ajStrPrefix
or the prefix ajStrSuffix
will compare the start or end of a string to the given prefix or suffix respectively. Variants that take a C-type (char *
) string as the second argument are available but not shown:
/* Prefix comparison */ AjBool ajStrPrefixS(const AjPStr str, const AjPStr str2); /* Case-insensitive prefix comparison */ AjBool ajStrPrefixCaseS (const AjPStr str, const AjPStr pref); /* Suffix comparison */ AjBool ajStrSuffixS (const AjPStr thys, const AjPStr suff); /* Case-insensitive suffix comparison */ AjBool ajStrSuffixCaseS (const AjPStr str, const AjPStr pref);
String search functions have the prefix ajStrFind
and are used to find substrings or characters within strings:
/* Find a string */ ajint ajStrFindS (const AjPStr str, const AjPStr str2); /* Find a character */ ajint ajStrFindAnyK(const AjPStr str, char chr); /* Find any character in a set */ ajint ajStrFindAnyS (const AjPStr str, const AjPStr str2); /* Find a string (case-insensitive) */ ajint ajStrFindCaseS (const AjPStr str, const AjPStr str2); /* Find any character not in a set */ ajint ajStrFindRestS (const AjPStr str, const AjPStr str2); /* Find any character not in a set (case-insensitive) */ ajint ajStrFindRestCaseS (const AjPStr str, const AjPStr str2); /* Find last occurence of a string */ ajint ajStrFindlastS (const AjPStr str, const AjPStr str2);
These functions return the position of the start of the search text in the sequence, or -1
if the text was not found.
ajStrFindAnyS
, ajStrFindRestS
, ajStrFindRestCaseS
use a set of characters provided as a string (str2
).
Functions for parsing text tokens from strings have the prefix ajStrExtract
or the prefix ajStrParse
.
To extract the first word (Pword
) and the remainder of the string (Prest
) from an input string (str
) use either of:
/* Remove first word (with no leading spaces) from a string */ AjBool ajStrExtractFirst (const AjPStr str, AjPStr* Prest, AjPStr* Pword); /* Remove first word from a string, skipping spaces */ AjBool ajStrExtractWord (const AjPStr str, AjPStr* Prest, AjPStr* Pword);
ajStrExtractWord
will skip any leading whitespace whereas ajStrExtractFirst
will return ajFalse
if the input string starts with a space. Like most of the string functions they will allocate memory for the strings if necessary although it is cleaner to allocate the strings manually. In the example below, ajStrExtractFirst
will return ajFalse
and the printed strings will be empty, whereas ajStrExtractFirst
will print the first word (First
) and the rest of the string ( word in this string is 'First'
) successfully:
AjPStr inputstring = NULL; AjPStr word = NULL; AjPStr rest = NULL; inputstring = ajStrNewC(" First word in this string is 'First'"); word = ajStrNew(); rest = ajStrNew(); ajStrExtractFirst(inputstring, &rest, &word); ajFmtPrint("word: %S\n", word); /* Empty */ ajFmtPrint("rest: %S\n", rest); /* Empty */ ajStrExtractWord(inputstring, &rest, &word); ajFmtPrint("word: %S\n", word); /* First */ ajFmtPrint("rest: %S\n", rest); /* word in this string is 'First' */ ajStrDel(&inputstring); ajStrDel(&word); ajStrDel(&rest);
There is a function to split a newline-separated multi-line string into an array of strings:
ajuint ajStrParseSplit(const AjPStr str, AjPStr **PPstr);
The function allocates memory for an array of strings (which must be freed later) and returns the number of array elements created:
AjPStr inputstring = NULL; AjPStr *array = NULL; ajint dim; ajint x; inputstring = ajStrNewC("First line\nSecond line\nThird line\n"); dim = ajStrParseSplit(inputstring, &array); for(x=0; x<dim; x++) ajFmtPrint("array[%d]: %S\n", x, array[x]); ajStrDel(&inputstring); for(x=0; x<dim; x++) ajStrDel(&array[x]); AJFREE(array);
String iteration allows you to step through a string a single character at a time. The AJAX datatype for this is:
AjIStr
String iteration object.
To iterate through a string you must first instantiate the string iteration object. Two constructors are provided for forward (start to end) or reverse (end to start) iteration:
/* Constructor for forward iteration */ AjIStr ajStrIterNew (const AjPStr thys); /* Constructor for reverse iteration */ AjIStr ajStrIterNewBack (const AjPStr thys);
To iterate through a string use either of the following functions. They return NULL
if iteration cannot continue:
AjIStr ajStrIterNext (AjIStr iter); AjIStr ajStrIterNextBack (AjIStr iter);
To retrieve the character or the remainder of the string at the current position use:
/* Retrieve the character */ char ajStrIterGetK (const AjIStr iter); /* Retrieve the remainder of the string */ const char* ajStrIterGetC (const AjIStr iter);
To change the character at the current position use:
void ajStrIterPutK (AjIStr iter, char chr);
The following functions return ajTrue
if iteration can continue and are also used to control iteration:
/* Test if iteration can continue */ AjBool ajStrIterDone (const AjIStr iter); /* Test if reverse iteration can continue */ AjBool ajStrIterDoneBack (const AjIStr iter);
A string iterator can be reset so that it points to the start or end of the string:
/* Reset forward iteration to start of string */ void ajStrIterBegin (AjIStr iter); /* Reset reverse iteration to end of string */ void ajStrIterEnd(AjIStr iter);
Once you are done, you must free the string iteration object:
/* Destructor for iteration object */ void ajStrIterDel (AjIStr *iter);
The example code below uses the iteration functions to iterate through a string, replacing all dash characters with full stops and writing a new string:
AjPStr str = NULL; AjPStr seq = NULL; AjIStr iter = NULL; char chr; str = ajStrNewC("--AALIY---TIWLASL--"); seq = ajStrNew(); iter = ajStrIterNew(str); while(ajStrIterNext(iter)) { if((chr = ajStrIterGetK(iter)) == '-') ajStrIterPutK(iter, '.'); else ajStrAppendK(&seq, chr); } ajFmtPrint("str: %S\n", str); ajFmtPrint("seq: %S\n", seq); ajStrDel(&str); ajStrDel(&seq); ajStrIterDel(&iter);
There is a dedicated AJAX datatype for string tokenisation:
AjPStrTok
String tokenisation object.
String tokenisation functions have the prefix ajStrToken
and are used to delimit a string into text tokens and extract them.
To tokenise a string you must first instantiate the string tokenisation object. The constructors provided take the delimiter characters either as a C-type (char *
) or as an EMBOSS string (AjPStr
) and return a pointer to a string tokenisation object:
AjPStrTok ajStrTokenNewC (const AjPStr str, const char* txtdelim); AjPStrTok ajStrTokenNewS (const AjPStr str, const AjPStr strdelim);
The sequence to be tokenised and a delimiter string can also be set for an existing string tokenisation object:
/* Specify string to be tokenised only */ AjBool ajStrTokenAssign (AjPStrTok* Ptoken, const AjPStr str); /* Specify string and delimiters */ AjBool ajStrTokenAssignC (AjPStrTok* Ptoken, const AjPStr str, const char* txtdelim); /* Specify string and delimiters */ AjBool ajStrTokenAssignS (AjPStrTok* Ptoken, const AjPStr str, const AjPStr strdelim);
These functions will allocate the string tokenisation object if necessary and can therefore be used as an alternative to the constructor function. It is however much clearer if they are only used to update an object that has been created using the standard constructors.
To parse individual tokens from a string call:
AjBool ajStrTokenNextFind (AjPStrTok* Ptoken, AjPStr* Pstr); AjBool ajStrTokenNextFindC (AjPStrTok* Ptoken, const char* strdelim, AjPStr* Pstr);
ajStrTokenNextFindC
will update the string tokenisation object with the string of delimiters (strdelim
).
To return the remainder of a string that's been partially parsed call:
AjBool ajStrTokenRestParse (AjPStrTok* Ptoken, AjPStr* Pstr);
If you want the delimiter to be treated as a string rather than individual characters, in other words to tokenise the string using another string, use the following to parse individual tokens from the string:
AjBool ajStrTokenNextParse (AjPStrTok* Ptoken, AjPStr* Pstr); AjBool ajStrTokenNextParseC (AjPStrTok* Ptoken, const char* txtdelim, AjPStr* Pstr); AjBool ajStrTokenNextParseS (AjPStrTok* Ptoken, const AjPStr strdelim, AjPStr* Pstr);
These functions return ajTrue
if the token was parsed successfully or ajFalse
otherwise. ajStrTokenNextParseC
and ajStrTokenNextParseS
will update the string tokenisation object with the string of delimiters (strdelim
or txtdelim
provided. Note that these functions can return ajTrue
but write an empty token (Pstr
) in cases where the delimeter has been changed since the previous call.
The string tokenisation object can be reset (all strings cleared) so that it is ready for re-use by calling:
void ajStrTokenReset (AjPStrTok* Ptoken);
Once you are done you must free the string tokenisation object:
void ajStrTokenDel (AjPStrTok* Ptoken);
The example code below uses the string tokenisation object and its functions to retrieve individual lines from a string that contains multiple newline characters:
AjPStr inputstring = NULL; AjPStr token = NULL; AjPStrTok tokens = NULL; inputstring = ajStrNewC("First line\nSecond line\nThird line\n"); tokens = ajStrTokenNewC(inputstring, "\n"); while(ajStrTokenNextFind(&tokens, &token)) ajFmtPrint("token: %S\n", token); ajStrTokenDel(&tokens); ajStrDel(&inputstring);
A string (str
or txt
) may be tokenised with either whitespace characters or a specified set of delimiters given as a text string (txtdelim
):
/* Tokenise C-type string by set of delimiters */ AjPStr ajCharParseC (const char* txt, const char* delim); /* Tokenise by set of delimiters */ const AjPStr ajStrParseC (const AjPStr str, const char* txtdelim); /* Tokenise by whitespace */ const AjPStr ajStrParseWhite (const AjPStr str);
These functions use the C strtok
function and return tokens from the string. The first time the function is called it is passed the string to be parsed. For subsequent calls on the same string it is passed NULL
as the first argument. A pointer to the token is returned or NULL
when all tokens have been parsed:
AjPStr inputstring = NULL; AjPStr token = NULL; inputstring = ajStrNewC(" First word in this string is 'First'"); token = ajStrParseWhite(inputstring) /* Prints 'First' to the screen */ ajFmtPrint("token: %S\n", token); /* Prints the rest of the words, one word at a time */ while(token = ajStrParseWhite(NULL)) ajFmtPrint("token: %S\n", token); ajStrDel(&inputstring);
To count the tokens in a string in which the tokens are delimited by either whitespace characters or a specified set of delimiters (strdelim
) use:
/* Count tokens delimited by whitespace */ ajuint ajStrParseCount (const AjPStr line); /* Count tokens delimited by set of delimiters */ ajuint ajStrParseCountS (const AjPStr line, const AjPStr strdelim);
For convenience, several groups of functions are provided for handling C-type (char *
) strings. They all have the prefix ajChar
to distinguish them from the other functions.
In the same way as for EMBOSS strings, a C-type string can be created from a starting C-type (char *
) or string object, with or without a reserved size:
/* Create a string from a C-type string */ char* ajCharNewC (const char* txt); /* Create a string from a string object */ char* ajCharNewS (const AjPStr thys); /* Create an empty string of reserved size. */ char* ajCharNewRes(ajuint size); /* Create a string with reserved size from a C-type string */ char* ajCharNewResC(const char* txt, ajuint size); /* Create a string with reserved size from a string object */ char* ajCharNewResS(const AjPStr str, ajuint size); /* Create a string from a C-type string with specified length */ char* ajCharNewResLenC(const char* txt, ajuint size, ajuint len);
In all cases a pointer to the allocated memory is returned, which must be freed once you are done with it. To delete a C-type string call:
void ajCharDel (char** Ptxt);
For example:
char *string = NULL; string = ajCharNewC("This is a text string"); ajCharDel(&string);
Most of the string comparison functions (Section 6.5.17, “String Comparison Functions”) are available for C-type strings too:
AjBool ajCharMatchC (const char* txt1, const char* txt2); AjBool ajCharMatchCaseC (const char* txt1, const char* txt2); AjBool ajCharMatchWildC (const char* txt1, const char* txt2); AjBool ajCharMatchWildCaseC (const char* txt1, const char* txt2); AjBool ajCharMatchWildNextC (const char* txt1, const char* txt2); AjBool ajCharMatchWildWordC (const char* str, const char* txt); AjBool ajCharMatchWildNextCaseC (const char* txt1, const char* txt2); AjBool ajCharMatchWildWordCaseC (const char* str, const char* txt); AjBool ajCharPrefixC (const char* txt, const char* pref); AjBool ajCharPrefixCaseC (const char* txt, const char* pref); AjBool ajCharSuffixC (const char* txt, const char* suff); AjBool ajCharSuffixCaseC (const char* txt, const char* suff);
Variants of these functions that take a string object as the second argument are available (not shown) and have the suffix S
rather than C
.
Functions for formatting and printing a string are defined in the library file ajfmt.h/c
. Conversion characters are defined for all the EMBOSS fundamental datatypes (Section 5.1, “Basic Datatypes”) and are an extension of the basic C conversion codes. They are:
An AJAX string object AjPStr
is printed by %S
.
A C-type (char*
) string is printed by %s
. If the pointer is NULL
then <null>
is printed.
The filename of an AJAX file object AjPFile
is printed by %F
.
A boolean variable AjBool
is printed by %b
(for output as "Y/N
") or %B
("Yes/No
")
The date and time (AjPDate
) is printed by %D
.
Functions with the prefix ajFmtScan
are equivalent to the C *scan*
functions:
ajint ajFmtScanS (const AjPStr str, const char* fmt, ...); ajint ajFmtScanC (const char* txt, const char* fmt, ...); ajint ajFmtScanF (AjPFile thys, const char* fmt, ...);
Functions with the prefix ajFmtPrint
are equivalent to the C *print*
functions:
/* format and emit the "..." arguments according to fmt; writes to stdout */ void ajFmtPrint (const char *fmt, ...); /* format and emit the "..." arguments according to fmt; writes to a file object */ void ajFmtPrintF (AjPFile file, const char *fmt, ...); /* format and emit the "..." arguments according to fmt; writes to C FILE stream */ void ajFmtPrintFp (FILE *stream, const char *fmt, ...); /* formats the "..." arguments into a buffer with a maximum size according to fmt */ ajint ajFmtPrintCL (char *buf, ajint size, const char *fmt, ...); /* Block and print a string. String is split at given delimiters */ void ajFmtPrintSplit(AjPFile outf, const AjPStr str, const char *prefix, ajint len, const char *delim); /* Formats the "..." arguments into an AjPStr according to fmt */ AjPStr ajFmtPrintS (AjPStr *pthis, const char *fmt, ...) ; /* Formats the "..." arguments and appends to an AjPStr according to fmt */ AjPStr ajFmtPrintAppS (AjPStr *pthis, const char *fmt, ...) ;