6.12. Handling Phylogenetic Data

6.12.1. Introduction

The EMBASSY phylipnew package includes various applications for phylogenetic analysis.

A set of phylogenenetic data types is available to replicate the available input data types for phylip with automatic detection of the data formats (for example, distance matrix files). These are shown in the table (Table 6.19, “Phylogenenetic datatypes”).

Table 6.19. Phylogenenetic datatypes
AJAX datatypeACD datatype (for reading)ACD datatype (for writing)
AjPPhyloDist distance matrix datadistancesoutdistance
AjPPhyloFreq frequency datafrequenciesoutfreq
AjPPhyloProp properties datapropertiesoutproperties
AjPPhyloState state datadiscretestatesoutdiscrete
AjPPhyloTree phylogenetic tree datatreeouttree

6.12.2. AJAX Library Files

AJAX library files for handling phylogenetic data are listed in the table (Table 6.20, “AJAX Library Files for Handling Phylogenetic Data”). Library file documentation, including a complete description of datatypes and functions, is available at:

http://emboss.open-bio.org/rel/dev/libs/
Table 6.20. AJAX Library Files for Handling Phylogenetic Data
Library File DocumentationDescription
ajphyloData structures and functions for handling the phylipnew applications.
ajnexusData structures and functions for parsing the NEXUS file format.

ajphylo.h/cDefines the objects and functions for handling phylogenetic data. These include:

  • Phylogeny distance matrix object (AjPPhyloDist)

  • Phylogeny frequencies object (AjPPhyloFreq)

  • Phylogeny properties object (AjPPhyloProp)

  • Phylogeny discrete state data object (AjPPhyloState)

  • Phylogeny tree object (AjPPhyloTree).

They also include static functions for handling phylogenetic data at a low level. You are unlikely to need these unless you plan to extend the phylogeny handling code.

ajnexus.h/cFunctions and objects (including static data structures and functions) for parsing the NEXUS file format. You are unlikely to need this library file. See the online library documentation for further information.

6.12.3. AJAX Datatypes

For handling phylogenetic data input files defined in the ACD file use:

AjPPhyloState*

Phylogeny discrete state data object (for discretestates ACD datatypes).

AjPPhyloDist

Phylogeny distance matrix object (for distances ACD datatype).

AjPPhyloFreq

Phylogeny frequencies object (for frequencies ACD datatype).

AjPPhyloProp

Phylogeny properties object (for properties ACD datatype).

AjPPhyloTree*

Phylogeny tree object (for tree ACD datatype).

For handling phylogenetic data output files defined in the ACD file use:

AjPOutfile

Output file (for all phylogenetic output ACD datatypes).

6.12.4. ACD Datatypes

The datatypes for handling phylogenetic data input are:

discretestates

Discrete states file.

distances

Distance matrix.

frequencies

Frequency value(s).

properties

Property value(s).

tree

Phylogenetic tree.

The datatypes for handling phylogenetic data output are:

outdiscrete

Output file for phylogenetics discrete characteristics data.

outdistances

Output file for phylogenetics distance matrix data.

outfreq

Output file for phylogenetics character frequency data.

outproperties

Output file for phylogenetics property data.

outtree

Output file for phylogenetic tree data.

6.12.5. ACD Data Definition

Typical ACD definitions for phylogenetic data input and output are shown below.

6.12.5.1. discretestates

Input of discrete states data:

discretestates: discretestatesfile
[
    parameter: "Y"
    characters: "01PB?"
    knowntype: "discrete states"
    information: "Phylip discrete states file"
]

6.12.5.2. distances

Input of distances data:

distances: distancesfile 
[
    parameter: "Y"
    knowntype: "distance matrix"
    information: "Phylip distance matrix file"
]

6.12.5.3. frequencies

Input of frequencies data:

frequencies: frequenciesfile 
[
    parameter: "Y"
]

6.12.5.4. properties

Input of properties data:

properties: propertiesfile 
[
    characters: "01"
    length: "$(infile.discretelength)"
    knowntype: "ancestral states"
    information: "Phylip ancestral states file"
]

6.12.5.5. tree

Input of tree data:

tree: treefile 
[
    parameter: "Y"
    knowntype: "newick"
    information: "Phylip tree file (optional)"
]

6.12.5.6. outdiscrete

Output of discrete states data:

outdiscrete: outdiscretefile 
[
    parameter: "Y"
]

6.12.5.7. outdistance

Output of distances data:

outdistance: outdistancefile
[
    parameter: "Y"
]

6.12.5.8. outfreq

Output of frequencies data:

outfreq: outfreqfile
[
    parameter: "Y"
]

6.12.5.9. outproperties

Output of properties data:

outproperties: outpropertiesfile
[
    parameter: "Y"
]

6.12.5.10. outtree

Output of tree data:

outtree: outtreefile
[
    parameter: "Y"
]

6.12.5.11. Parameter Name

All data definitions for phylogenetic data input and output should have intuitive names. There are some general guidelines but currently no specific naming rules are enforced. See Appendix A, ACD Syntax Reference.

6.12.5.12. Common Attributes

Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).

parameter: If the phylogenetic data is the primary input or output of an EMBOSS application then it should be defined as a parameter by using the global attribute parameter: "Y".

characters: Specifies the allowed discrete state or property characters for a discretestates or properties object respectively.

knowntype: This global attribute is typically specified for all the phylogenetic input and output types.

information: A global attribute used for the user prompt and in the application documentation.

length: Specifies the number of property values per set (properties datatype) or the number of frequency loci / values per set (frequencies datatype).

size: Specifies the number of discrete state sets (discretestates datatype), the number of frequency sets (frequencies datatype) or the number of trees (tree datatype).

Note

Various calculated attributes (Section A.6, “Calculated Attributes”) of the datatypes are available at the level of the ACD file.

6.12.6. ACD File Handling

Datatypes and functions for handling phylogenetic data via the ACD file are shown below (Table 6.21, “Datatypes and Functions for Phylogenetic Data Input and Output”).

Table 6.21. Datatypes and Functions for Phylogenetic Data Input and Output
ACD datatypeAJAX datatypeTo retrieve from ACD
Phylogenetic Data Input
discretestatesAjPPhyloState*ajAcdGetDiscretestates
distancesAjPPhyloDistajAcdGetDistances
frequenciesAjPPhyloFreqajAcdGetFrequencies
propertiesAjPPhyloPropajAcdGetProperties
treeAjPPhyloTree*ajAcdGetTree
Phylogenetic Data Output
outdiscreteAjPOutfileajAcdGetOutdiscrete
outdistanceAjPOutfileajAcdGetOutdistance
OutfreqAjPOutfileajAcdGetOutfreq
outpropertiesAjPOutfileajAcdGetOutproperties
OuttreeAjPOutfileajAcdGetOuttree

Your application code will call embInit to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet* family of functions which return pointers to appropriate objects.

6.12.6.1. Phylogenetic Data Retrieval

6.12.6.1.1. Input Phylogenetic Data

To retrieve input phylogenetic data an object pointer is declared and then initialised using the appropriate ajAcdGet* function.

6.12.6.1.1.1. discretestates
    AjPPhyloState *data=NULL;

    data = ajAcdGetDiscretestates("discretestatesfile");
6.12.6.1.1.2. distances
    AjPPhyloDist data=NULL;

    data = ajAcdGetDistances("distancesfile");
6.12.6.1.1.3. frequencies
    AjPPhyloFreq data=NULL;

    data = ajAcdGetFrequencies("frequenciesfile");
6.12.6.1.1.4. properties
    AjPPhyloProp data=NULL;

    data = ajAcdGetProperties("propertiesfile");
6.12.6.1.1.5. tree
    AjPPhyloTree* data=NULL;

    data = ajAcdGetTree("treefile");
6.12.6.1.2. Output Phylogenetic Data

To retrieve an output phylogenetic data stream an object pointer is declared and initialised using the appropriate ajAcdGet* function.

6.12.6.1.2.1. outdiscrete
    AjPOutfile outfile=NULL;

    outfile = ajAcdGetOutdiscrete("outdiscretefile");
6.12.6.1.2.2. outdistance
    AjPOutfile outfile=NULL;

    outfile = ajAcdGetOutdistance("outdistancefile");
6.12.6.1.2.3. outfreq
    AjPOutfile outfile=NULL;

    outfile = ajAcdGetOutfreq("outfreqfile");
6.12.6.1.2.4. outproperties
    AjPOutfile outfile=NULL;

    outfile = ajAcdGetOutproperties("outpropertiesfile");
6.12.6.1.2.5. outtree
    AjPOutfile outfile=NULL;

    outfile = ajAcdGetOuttree("outtreefile");
6.12.6.1.3. Alternative ACD Retrieval Functions

There are functions to retrieve a single (the first) state or tree object from file:

AjPPhyloState  ajAcdGetDiscretestatesSingle (const char *token);
AjPPhyloTree   ajAcdGetTreeSingle (const char *token);

Where these are used, it is still necessary to call the appropriate destructor function (see below) to ensure that the array of state or tree objects allocated during ACD file processing is freed.

6.12.6.2. Processing Command line Options and ACD Attributes

Currently there are no functions for this.

6.12.6.3. Memory and File Management

It is your responsibility to close any files and free up memory at the end of the program.

6.12.6.3.1. Closing Output Phylogenetic Data Files

To close an output phylogenetic data stream call ajOutfileClose with the address of the output file:

ajOutfileClose(&outfile);
6.12.6.3.2. Freeing Memory

You must call the appropriate destructor function (see below) on any phylogenetic data objects returned by calls to ajAcdGet*.

Additionally, you must call ajPhyloExit to free up any internal memory allocated internally for housekeeping:

void  ajPhyloExit(void);

6.12.7. Phylogenetic Object Memory Management

6.12.7.1. Default Object Construction

To use a phylogenetic data object that is not defined in the ACD file you must first instantiate the appropriate object pointer. The default constructor functions are:

AjPPhyloDist   ajPhyloDistNew (void);
AjPPhyloFreq   ajPhyloFreqNew (void);
AjPPhyloProp   ajPhyloPropNew (void);
AjPPhyloState  ajPhyloStateNew (void);
AjPPhyloTree   ajPhyloTreeNew (void);

6.12.7.2. Default Object Destruction

You must free the memory for an object once you are finished with it. The default destructor functions are:

void  ajPhyloDistDel (AjPPhyloDist* pthis);
void  ajPhyloFreqDel (AjPPhyloFreq* pthis);
void  ajPhyloPropDel (AjPPhyloProp* pthis);
void  ajPhyloStateDel (AjPPhyloState* pthis);
void  ajPhyloTreeDel (AjPPhyloTree* pthis);

The default constructor and destructor functions are used as follows:

    AjPPhyloDist   dist  = NULL; 
    AjPPhyloFreq   freq  = NULL; 
    AjPPhyloProp   prop  = NULL; 
    AjPPhyloState  state = NULL; 
    AjPPhyloTree   tree  = NULL; 

    /* Call constructor functions */
    dist  = ajPhyloDistNew(); 
    freq  = ajPhyloFreqNew(); 
    prop  = ajPhyloPropNew(); 
    state = ajPhyloStateNew(); 
    tree  = ajPhyloTreeNew();

    /* Do something with instantiated objects */
    ...

    /* Call destructor functions */
    ajPhyloDistDel (&dist);
    ajPhyloFreqDel (&freq);
    ajPhyloPropDel (&prop);
    ajPhyloStateDel (&state);
    ajPhyloTreeDel (&tree);

There are two alternative destructor functions used to free arrays of state and tree objects:

void  ajPhyloStateDelarray(AjPPhyloState** pthis);
void  ajPhyloTreeDelarray(AjPPhyloTree** pthis);

They are used for state and tree objects instead of the default destructor to free memory from ACD file processing:

    AjPPhyloState* states = NULL;
    AjPPhyloTree*  trees  = NULL;

    states = ajAcdGetDiscretestates(discretestatesfile);
    trees  = ajAcdGetTree(treefile);

    /* Do something with objects */

    ajPhyloStateDelarray(&states);
    ajPhyloTreeDelarray(&trees);

6.12.7.3. Alternative Object Construction and Loading

6.12.7.3.1. Phylogenetic data input

Currently there are no functions for this.

6.12.7.3.2. Phylogenetic data output

Currently there are no functions for this.

6.12.8. Reading Phylogenetic Data from File

The functions for this are:

AjPPhyloDist*   ajPhyloDistRead (const AjPStr filename, ajint size, AjBool missing);
AjPPhyloFreq    ajPhyloFreqRead (const AjPStr filename, AjBool contchar, AjBool genedata, AjBool indiv);
AjPPhyloProp    ajPhyloPropRead (const AjPStr filename, const AjPStr propchars, ajint len, ajint size);
AjPPhyloState*  ajPhyloStateRead (const AjPStr filename, const AjPStr statechars);
AjPPhyloTree*   ajPhyloTreeRead (const AjPStr filename, ajint size);

They are provided in case phylogenetic data needs to be processed outside the context of ACD file processing. See the on-line documentation for further information.

6.12.9. Getting Elements of Phylogenetic Objects

Currently there is a single function for this. It returns the size of a properties object:

ajint  ajPhyloPropGetSize (const AjPPhyloProp thys);

6.12.10. Debug Functions

These report the elements of each phylip object to the debug file. The functions are:

void  ajPhyloDistTrace (const AjPPhyloDist thys);
void  ajPhyloFreqTrace (const AjPPhyloFreq thys);
void  ajPhyloPropTrace (const AjPPhyloProp thys);
void  ajPhyloStateTrace (const AjPPhyloState thys);
void  ajPhyloTreeTrace (const AjPPhyloTree thys);