4.3. Data Definition

4.3.1. Data Definition Format

Every application option must be defined in the ACD file. The definition of these options (or "data definitions") follow the application definition in the file. All data definitions must be contained within an appropriate ACD file section (see Section 4.1.5, “ACD File Sections”). An error will be generated during ACD processing otherwise.

The general format for data definitions is as follows. The first token is an ACD datatype, the second token the option name, followed by data attributes given between square brackets. Each attribute is a name: value pair:

Datatype: OptionName 
[ 
   DataAttribute1Name: "DataAttribute1Value"
   DataAttribute2Name: "DataAttribute2Value"
]

Datatype must be a valid ACD datatype (Section A.2, “Datatypes”).

These are predefined and include simple types such as integer and float, and biological types such as sequence. Other types are available to define such things as menus (e.g. list) and control whether values of other data definitions are prompted for (e.g. toggle).

OptionName is the name of the option and is synonymous with the qualifier name, parameter name or command line flag. It is used to refer to the data definition from the command line. The value of any option can be set on the command line if the flag is specified before it:

-OptionName OptionValue

The programmer must have a handle on each application option from within the C source code and the flag is used for this purpose too (see Section 6.3, “Handling ACD Files”). The option name can be any string you like within reason, but conforming to certain conventions (Section 4.3.2, “Parameter Naming Conventions”).

The attributes (Section 4.3.4, “Types of Data Attributes”) give you control over the application parameters. They allow you to specify such things as the user-prompt itself and 'help' documentation for each parameter or qualifier. Default values and the requirements for a correct value, such as a permissible ranges of values (maxima and minima) can be set. They also allow the application programmer to control how the user is prompted for values, if at all (Section 4.5, “Controlling the Prompt”).

The attribute values are the strings enclosed in double quotes and may be parsed during ACD file processing into some other type depending on the data definition in which they occur. For example, this defines a default value:

default: "5.0"

If the above appeared in the definition for a floating point number (float:) then it would be held internally as a floating point number.

Example. wossname has a string input parameter (search) and the string definition has five attributes (parameter:, default:, information:, help: and knowntype:):

string: search 
[
    parameter: "Y"
    default: ""
    information: "Text to search for, or blank to list all
                  programs"
    help: "Enter a word or words here and a case-independent search
           for it will be made in the one-line documentation, group names and
           keywords of all of the EMBOSS programs. If no keyword is
           specified, all programs will be listed."
    knowntype: "emboss keyword"
]

4.3.2. Parameter Naming Conventions

The qualifier or parameter name (the command line flag) of an ACD data definition must be a lower case alphanumerical string without whitespace characters. In theory it can be of any length. In practice, there are conventions (Section A.1, “Introduction to ACD Syntax”).

Some conventions are general and some specific to individual datatypes; it is strongly recommended that you follow them. In some cases a warning or error message will be generated during ACD processing (i.e. when the ACD file is parsed, typically after the application is run) if you do not.

For example, the conventional name for the sequence datatype (input sequence) is sequence or *sequence and for seqout (output sequence) is outseq or *outseq. In other words, where more than one instance of a datatype is specified in an ACD file, then the characters a, b etc can be prepended to the flag. For example, use asequence, bsequence etc where more than one sequence is specified, or aoutseq, boutseq etc where more than one sequence output stream is specified. You can see this in the matcher application, which takes two input sequences:

sequence: asequence 
[
    parameter: "Y"
    type: "protein"
]

sequence: bsequence 
[
    parameter: "Y"
    type: "stopprotein"
]

4.3.3. ACD Datatypes

4.3.3.1. Groupings of ACD Datatypes

For convenience the available ACD datatypes are organised into five groupings reflecting similar properties or modes of usage as follows:

  • Simple Datatypes

  • Input Datatypes

  • Selection Datatypes

  • Output Datatypes

  • Graphics Datatypes

The groupings are described below and in some cases are subdivided further for convenience. See also a complete description of the available datatypes including their data value, default value and key attributes (Section A.2, “Datatypes”).

4.3.3.2. Simple ACD Datatypes

The simple datatypes (Section A.2.1, “Description of Simple ACD Datatypes”) include primitive and some derived datatypes:

boolean

Simple boolean value

float

Simple floating point number

integer

Simple integer number

string

Simple string

toggle

Simple boolean switch for controlling other parameters

array

List of either integer or floating point numbers

range

Range of sequence positions

regexp

Regular expression pattern

pattern

Sequence pattern

4.3.3.3. Input ACD Datatypes

The input datatypes (Section A.2.2, “Description of Input ACD Datatypes”) cater for input of sequences, sequence features, files and directories, inputs specific to the phylipnew EMBASSY package, data files and other files of biological data.

4.3.3.3.1. Sequence Input

Input datatypes for handling biological sequences include:

sequence

A single sequence for reading

seqall

A set of single sequences that are addressed one after another

seqset

A set of single sequences that can be used all at the same time

seqsetall

One or more sets of single sequences that can be used all at the same time

The data value in all cases is the Uniform Sequence Address (USA) of an input sequence stream. The USA specification is described in the EMBOSS Users Guide.

4.3.3.3.2. Feature Input

There is a single datatype for handling biological sequence features input:

features

Sequence feature annotation in any supported feature format

The data value in all cases is the Uniform Feature Object (UFO) of an input sequence feature stream. The UFO specification is described in the EMBOSS Users Guide.

4.3.3.3.3. Files and Directories

Input datatypes for handling general files and directories of files include:

directory

A directory that can be used for input or output

dirlist

A list of file names that are read from a directory

filelist

A list of input files

infile

Non-sequence-related input file

In all cases, the data value is any valid file or directory name.

4.3.3.3.4. Data Files

Input datatypes for handling data files include:

datafile

A formatted data file read from the standard EMBOSS data search path

matrix

Comparison matrix file (integer values)

matrixf

Comparison matrix file (floating point values)

In all cases, the data value is the name of a file in the EMBOSS data search path (see the EMBOSS Users Guide).

Typically where a comparison matrix is specified, gap penalties will also be required. These must be specified separately in one or more other data definitions.

4.3.3.3.5. phylipnew EMBASSY Package

Input datatypes specific to the phylipnew EMBASSY package are given below. They allow detailed type checking, and to automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

discretestates

Discrete states file

distances

Distance matrix

frequencies

Frequency value(s)

properties

Property value(s)

tree

Phylogenetic tree

In all cases, the data value is any valid file name.

4.3.3.3.6. Other Biological Inputs

Other biological input datatypes include:

codon

Codon usage table file

cpdb

Protein coordinate data in a simple file format (clean coordinate file)

scop

SCOP and CATH domain classification information in a simple file format (domain classification file)

In all cases, the data value is any valid file name.

4.3.3.4. Output ACD Datatypes

The output datatypes cater for output of sequences, sequence features, alignments, files and directories, outputs specific to the phylipnew EMBASSY package, data files, other files of biological data and formatted application output files (reports). See Section A.2.3, “Description of Output ACD Datatypes”.

4.3.3.4.1. Sequence Output

Output datatypes for handling biological sequences include:

seqout

Output file for single sequence

seqoutall

Output file for multiple sequences

seqoutset

A set of single sequences stored in memory together, to be written to file

The behaviour of these datatypes is identical but they are provided for consistency with the input sequence datatypes (see above).

The data value in all cases is the Uniform Sequence Address (USA) of an output sequence stream (see the EMBOSS Users Guide).

FASTA format is used by default for the output sequence(s). The format is normally set at the command line but a default may be hard-coded with osformat: (see Section 4.3.6.3.2, “Feature Output”).

4.3.3.4.2. Features

There is a single output datatype for handling biological sequence features:

featout

Output file for sequence feature annotation

The data value is the Uniform Feature Object (UFO) of an output sequence feature stream (see the EMBOSS Users Guide).

4.3.3.4.3. Alignments

There is a single output datatype for handling alignments:

align

Output file for sequence alignments

The data value is any valid file name.

4.3.3.4.4. Output Files and Directories

Output datatypes for handling general files and directories of files include:

outdir

Output directory for writing of multiple output files

outfile

General output file

outfileall

Multiple general output files

outfile and outfileall are used for data not catered for by some other output ACD datatype. For example, the output file would not normally contain sequence data. They are suitable for general application output in plain text.

The data value is any valid file name.

4.3.3.4.5. Output Data Files

Output datatypes for handling data files include:

outdata

Output file for data formatted cleanly as a table or list

outmatrix

Output file for integer comparison matrix data

outmatrixf

Output file for floating point comparison matrix data

In all cases the data value is any valid file name.

4.3.3.4.6. phylipnew EMBASSY Package

Output datatypes specific to the phylipnew EMBASSY package are given below. They allow EMBOSS to provide detailed type checking, and automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

outdiscrete

Output file for phylogenetic discrete characteristics data

outdistance

Output file for phylogenetic distance matrix data

outfreq

Output file for phylogenetic character frequency data

outproperties

Output file for phylogenetic property data

outtree

Output file for phylogenetic tree data

In all cases, the data value is any valid file name.

4.3.3.4.7. Other Biological Outputs

Other biological output datatypes include:

outcodon

Output file for codon usage data

outcpdb

Output file for protein coordinate data in CCD (clean coordinate file) format

outscop

Output file for SCOP and CATH domain classification information in DCF (domain classification file) format

The data value is any valid file name.

4.3.3.4.8. Report Output

There is an output datatype for handling formatted application output:

report

Output file for sequence annotation

The data value is any valid file name.

4.3.3.5. Selection ACD Datatypes

The selection datatypes cater for menus (Section A.2.4.2, “selection.

The user is presented with a limited list of options they can choose from. An option consists of a label and descriptive text. One or more options may be selected. Selection datatypes include:

list

A list of options (typically terse text descriptions) with text labels

selection

A list of options (typically verbose text descriptions) with automatically-generated numerical labels

The data value is one (or more) of the valid options. An option is specified by label (whether text or numerical) or by a non-ambiguous part of the descriptive text itself given after the label. If multiple selections are allowed then the user must supply a comma-separated list of options.

4.3.3.6. Graphics ACD Datatypes

The graphics datatypes cater for graphical output (Section A.2.5, “Description of Graphics ACD Datatypes”.

Graphics datatypes include:

graph

Graphical output of any general kind

xygraph

Graphical output as a simple two dimensional (2D) XY plot with the sequence along the x-axis

The data value is a supported graphics device (Section A.2.5, “Description of Graphics ACD Datatypes”).

4.3.4. Types of Data Attributes

There are three basic types of attributes that may be defined for a data definition:

Additionally, there are various command line qualifiers that are inbuilt for certain ACD datatypes (see the EMBOSS Users Guide) which may also be defined as attributes in the appropriate data definition i.e.:

Global attributes are available for all datatypes and can be defined in any ACD data definition.

Datatype-specific attributes, in contrast, are available (can be defined) for certain datatypes only. Each ACD datatype has its own set of specific attributes. The values of global and datatype-specific attributes are set explicitly in the ACD file.

Calculated attributes are a type of datatype-specific attribute that are assigned values after the data definition has been processed (read and validated), for example, once a sequence has been read in from file. This allows the data definition to refer to attributes whose value is not known up front but depends on the input data, and is calculated automatically during ACD file processing.

Datatype-associated qualifiers (or simply associated qualifiers) are defined for single datatypes or groups of related ACD datatypes and are normally only specified on the command line. They may, however, also be "hard coded", i.e. defined as attributes in the appropriate data definition.

4.3.5. Global Attributes

Global attributes are available for all datatypes. They may be defined in any ACD data definition and their value is set explicitly in the ACD file (Section A.4, “Global Attributes”).

Global attributes are defined with a token for the attribute name and a second token for the value of the attribute. The attribute name must be followed by a colon ':' and the value should be enclosed in double quotes (a warning will be generated during ACD processing otherwise):

GlobalAttributeName: "Value"

Most global attributes have string or boolean values. The booleans have a hard-coded default value which can be overridden by "Y", "Yes", "N" or "No" (the strings are case-insensitive). All of the following attributes are therefore valid:

parameter: "YES"
parameter: "Yes"
parameter: "Y"
parameter: "NO"
parameter: "No"
parameter: "N"

The global attributes are summarised below. For convenience they are grouped by function as follows:

  • Parameters and qualifiers

  • User prompting

  • Datatype definition

  • Help information and documentation

  • Hints for GUIs

  • For use by SOAPLAB

4.3.5.1. Parameters and Qualifiers

Each application parameter, i.e. every data definition in the ACD file, can be defined via the appropriate global attribute to be one of a "parameter", "standard qualifier" or "additional qualifier" with the default of "advanced qualifier". These four types have different properties in terms of how they may be specified on the command line, how they are prompted for and the location of the help information. See Section 4.3, “Data Definition” for further information.

The attributes have boolean values and are defined as follows:

parameter:  "Boolean"    ("N")
standard:   "Boolean"    ("N")
additional: "Boolean"    ("N")

The default value is "N" however this is never specified explicitly in the ACD file (see Section 4.3, “Data Definition”).

Reports. The report datatype, where used, is typically a primary output of an application and, as such, should be defined as a parameter using the parameter: attribute. The first report file must be defined as parameter: (an error will be generated during ACD processing otherwise). It is recommended that subsequent report definitions (second, third and so on) are also defined as parameters (a warning will be generated during ACD processing if they are not). The exception is if the default: or nullok: attributes are set, in which case no warning or error messages are generated as the application can run with a default or without any value for the definition respectively.

Sequences. Sequence inputs (sequence, seqall, seqsetall or seqset ACD datatypes) and outputs (seqout, seqoutall and seqoutset ACD datatypes) are typically the primary input or output of an EMBOSS application, and as such should be defined as parameters in the same way as for the report datatype above.

Features. Sequence features, where used, are typically the primary input (feature ACD datatype) or output (featout), as such, should be defined as parameters in the same way as for the report datatype.

Alignments. The align datatype, where used, is typically the primary output and should be defined as a parameter. The first and subsequent alignment outputs should be defined as parameter: types, a warning will be generated during ACD processing if they are not. The exception is if the default: or nullok: attributes are set, in which case no warning or error messages are generated as the application can run with a default or without any value for the definition respectively.

Files. File datatypes, where there are used, are typically the primary input (infile, filelist, directory or dirlist ACD datatypes) or output (outdir ACD datatype) and should be defined as parameters in the same way as for the report datatypes described above.

4.3.5.2. User Prompting

The attributes below are used to provide the text that will be used on the screen to prompt the user for values. Whether a prompt will appear at all depends on whether the option was defined as a parameter or one of the qualifier types (see above) and on the command used to invoke the application (i.e. what options were specified). In some cases, the prompt may or may not appear depending on the value of other inputs. For more information see Section 4.5, “Controlling the Prompt”).

The attributes are defined as follows (the strings are empty by default):

information: "String"      ("")
code: "String"             ("")
prompt: "String"           ("")

Only one of code:, prompt: or information should ever be defined. The use of information: (with a standard name, see below) is preferred. In practice, prompt: is only ever required in the rare cases where the information: string might be misleading.

To provide standard prompts a default value for the information: string is defined for most common datatypes (see Section 4.3.5.2.1, “Standard Prompts File (codes.english)”). The standard practice is, where possible, to use the default prompt for all input and output ACD datatypes. A warning will be generated during ACD processing if either the information: attribute is missing or if the value used is not the standard value (where a standard value is available).

If a non-standard prompt is used, the text given after the information: attribute should conform to certain conventions (Section A.4.4.1, “information: "String" ("")”

In the example below a non-standard prompt is defined for asequence. A warning will be generated if you try to run this:

sequence: asequence 
[
    standard: "Y"
    information: "Enter filename"
]

In the next example, no information: string is specified, therefore the standard prompt of "Read sequence from" (from codes.english) would be used instead:

sequence: asequence 
[
    standard: "Y"
]
4.3.5.2.1. Standard Prompts File (codes.english)

To provide standard prompts, a default value for the information: string is defined for most common datatypes. The defaults are in the EMBOSS system file:

codes.english

in the application ACD file directory, e.g.:

.../emboss/emboss/emboss/acd

A default has the name DEFDatatypeName where DatatypeName is the name of the ACD datatype. The file also contains some additional standard prompts for specific instances of individual datatypes, identified by a code (not beginning with DEF). These prompts can be used with the code: attribute, for example code: "GAP".

An excerpt of the file is shown below:

#
# This is the EMBOSS coded prompts file. Any messages found here will
# override those in the .acd file, and can be translated into other
# languages. The file extension (default "english") is set by the
# variable "emboss_language"
#
# Messages are referred to by .acd files as code: NAME
#
# DEFXXXX codes are automatically searched for ACD type XXXX if there
# is no information (or prompt) defined in the ACD file of as a
# standard prompt for the ACD type.
#
# HELPXXXX codes are automatically searched for ACD type XXXX if there
# is no help text in the ACD file.
#

# Default prompts: these are used where no prompt for a data type
# has been provided.

DEFALIGN     "Write output alignment to"
DEFREPORT    "Write output report to"
DEFINTEGER   "Enter a number"
DEFFLOAT     "Enter a number"
DEFBOOL      "Yes or No"
.
.
.
# Gap penalties

GAP          "What gap penalty"
GAPEXT       "What gap extension penalty"

4.3.5.3. Datatype Definition

Five global attributes are associated with, or describe, the data to which the ACD definition refers. For example, a parameter can be given a default value, be assigned to a known type taken from a controlled vocabulary, or its relations to other parameters described. The attributes are defined as follows:

knowntype: "String"        ("")
default: "Value"           ("")
relations: "String"        ("")
outputmodifier: "Boolean"  ("N")
missing: "Boolean"         ("N")

knowntype: is a string describing the type of a data definition and should be defined where the type is not already clear from the datatype itself. It is typically defined for string:, infile:, outfile: and outfileall datatypes but not, for example, for a sequence:.

Valid known type strings are listed in the file knowntypes.standard (Section 4.3.5.3.1, “Application Data Known Types File (knowntypes.standard)”). A few other values are accepted, for example "ApplicationName output" for an outfile datatype. These are documented with the datatypes (Section A.2, “Datatypes”.)

If a value is given that is not a standard known type or other accepted value, a warning message will be generated during ACD processing. The acdvalid utility will check all knowntype values in an ACD file, and report any missing values for data definitions that require a known type.

4.3.5.3.1. Application Data Known Types File (knowntypes.standard)

The standard values (known types) are read from the EMBOSS system file:

knowntypes.standard

which can be found in the application ACD file directory:

.../emboss/emboss/emboss/acd

An excerpt of this file is shown below:

# Known Types
# Knowntype_string                Type           Comment
aaindex_data                      file           AAINDEX entry
aaindex_database                  file           AAINDEX database
abi_trace                         file           ABI sequencing trace
ajint_ajlong_data                 file           Standard format info on ajint and ajlong
alistat_input                     file           HMMER alistat program input
alistat_output                    file           HMMER alistat program output
amino_acid_classification         file           Amino acid chemical classes data
.
.
.

4.3.5.4. Help Information and Documentation

Every EMBOSS application accepts the global qualifiers -help and -verbose, the latter used in combination with the former (if at all). When an application is run with -help, helpful information is printed to the screen, including all the program parameters and qualifiers with explanatory text alongside.

The attributes below define the text that's used:

help: "String"             ("")
valid: "String"            ("")
expected: "String"         ("")

help: is usually only defined if a deeper explanation of an application parameter is needed. If help: is not defined, the value of the information: attribute (if available) or the default help string (see below) will be used instead. valid: and expected: are used to describe the allowed and expected values for the online documentation. They are not usually required as, in most cases, a reasonable value is generated automatically.

An example where help: is helpful:

integer: window 
[
    standard: "N"
    default: 10
    minimum: 5
    maximum: 100
    information: "Window size"
    help: "Number of residues used to calculate the value for each point"
]

The help: string must conform to certain conventions (Section A.4.6.1, “help: "String" ("")”).

4.3.5.4.1. Standard Help Strings File (codes.english file)

Default help strings are given for each datatype in the EMBOSS system file codes.english file (see Section 4.3.5.2.1, “Standard Prompts File (codes.english)” above). They have the general form:

HELPDatatypeName

where DatatypeName is the name of the ACD datatype.

For example:

HELPSEQUENCE  "Sequence USA"
HELPDIRECTORY "Directory name"

4.3.5.5. Hints for GUIs

The needed: global attribute indicates whether a parameter is expected to be included in a GUI form. It is used to help GUI developers and more such attributes may be added in the future as support for external developers is improved. Certain datatype-specific attributes, for example button: which can be defined for selection lists, are also used to provide such hints. See Section A.5, “Datatype-specific Attributes”.

The attribute is defined as follows:

needed: "Boolean"            ("Y")

4.3.5.6. For use by SOAPLAB

The following attributes provide metadata for the SOAPLAB project. They are not used by EMBOSS directly and should not appear in standard EMBOSS application ACD files.

The attributes are defined as follows:

qualifier: "String"        ("")
template: "String"         ("")
comment: "String"          ("")

4.3.6. Datatype-specific Attributes

Datatype-specific attributes given within an ACD data definition define the characteristics of application options. They control such things as:

  • The name and location of input and output files

  • Detailed type-checking of inputs to ensure data is as required by the application

  • Exactly what data is generated on output

  • The appearance and behaviour of the user interface at the command line

  • Validation of user input, e.g. to ensure user input is within permissible ranges

  • Validation of application output, e.g. to ensure it is as expected

  • Suggestions of graphical aspects for other interfaces to EMBOSS applications

Datatype-specific attributes are defined for individual datatypes or for groups of related datatypes. The key attributes for each datatype are summarised below, organised by ACD datatype grouping ("simple", "input", "output" etc). For descriptions of all the available attributes see Section A.5, “Datatype-specific Attributes”.

4.3.6.1. Attributes for Simple ACD Datatypes

These are used to validate user input.

4.3.6.1.1. Validation of User Input

The validation possible depends on the available attributes and therefore the datatype in question.

minimum: and maximum: can be defined for several datatypes and restrict the value within a given range. Where minimum: and maximum: attributes have calculated values (Section 4.4.4, “Calculations and Tests”) it is theoretically possible for the maximum to be less than the minimum. In such cases either the maximum or minimum might be required, depending on the application in question. The following attributes are used:

  • trueminimum:. Boolean value; if "N" the minimum value is used if the minimum and maximum values overlap.

  • failrange: Boolean value; if "Y" the application fails if the calculated ranges overlap.

  • rangemessage:. String value; failure message to use if calculated ranges overlap.

An ACD file with a calculated range requires the failrange: attribute to be specified or will yield a "failrange is required" warning message. If you set failrange: "Y" you need to define a message explaining to the end user why the range might fail. If you set failrange: "N" the calculated range is accepted, but you also need to set trueminimum: to say whether you want the minimum value to apply (usually to avoid getting negative values) or the maximum (to avoid values going too large). In both cases, warning messages are generated if the required attributes are not given.

Attributes to control the length (minlength: and maxlength) and case (upper: "Y" or lower: "Y") of a regular expression or sequence pattern are available for regexp and pattern datatypes.

4.3.6.1.2. String Types

When a string is defined, a known type for it should be specified using knowntype: (see Section 4.3.5.3.1, “Application Data Known Types File (knowntypes.standard)”): a warning message will be generated during ACD processing otherwise. A regular expression can be used to validate a string if necessary.

4.3.6.2. Attributes for Input ACD Datatypes

These are used to specify the input data and to validate user input.

4.3.6.2.1. Sequence Input

sask: is available for all sequence inputs and sets the default for the -sask qualifier. If set to "Y" it specifies that a sequence begin and end position, and the reversing of a nucleotide sequence, will be prompted for.

nulldefault: overrides the default name generation and uses an empty string (no sequence input) as the default, for programs where sequence input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing:, this allows qualifiers to be null by default, and turned on from the command line.

The sequence input datatypes (sequence, seqall, seqset and seqsetall) share a common set of attributes used to define and validate the sequence input. The type of sequence can be restricted with type:, so that the program accepts, for example, only DNA sequences. The type must be a standard sequence type, for example dna, pureprotein, gaprna etc. See Section A.7, “Sequence Types”.

4.3.6.2.2. Feature Input

For all the sequence inputs (sequence, seqall, seqset and seqsetall datatypes), sequence features (Section 6.9, “Handling Features”) can be read if the features: attribute is set.

The type of features can be restricted by setting type:. For example, the program can be made to accept only DNA features. The feature type must be one of protein or nucleotide. There is a default based on the type of an input sequence (where used), but a value should be specified. If no type is specified for input features and there is no sequence input from which to take a default type, then an error will be generated during ACD processing.

4.3.6.2.3. phylipnew EMBASSY Package

Attributes of the phylipnew package datatypes provide detailed type checking and can automatically detect and validate the various alternative formats that phylipnew supports, without the need for complex extra command line options. See Section A.5, “Datatype-specific Attributes”.

length: specifies the number of property values per set (properties datatype) or the number of frequency loci / values per set (frequencies datatype).

size: specifies the number of discrete state sets (discretestates datatype), the number of frequency sets (frequencies datatype) or the number of trees (tree datatype).

4.3.6.2.4. Naming of Input Files

The name, extension and directory or fullpath attributes are defined for several datatypes and define the name and location of the input data. In some cases a default naming scheme or a hard-coded default is available but can be overridden by using these attributes. A default value for the input file name can also be set by defining the default: global attribute. Data files, for example, often have a hard-coded filename and you are free to define this using name: or default:.

In cases where a default file naming scheme is available, but a default value is also specified in the ACD file, then the default value in the ACD file can be overridden and the naming scheme used if a null value ("") for the parameter is given on the command line. It's necessary to set the global attribute missing: for such data definitions.

4.3.6.2.5. Missing Input Files

nullok: is available for most input datatypes. It is a boolean attribute and specifies whether a missing input file is acceptable. If the application can accept a null value for this definition and can run without the corresponding input file, the nullok: attribute must be set to "Y". This allows a default value to be omitted or the application to be run with -noFlag on the command line (where Flag is the data definition flag) to specify the input file is not available.

4.3.6.2.6. Input File Datatype Identification

The infile: datatype is used for general application input. The type of data can be identified using knowntype:. This allows inputs to be matched to outputs where knowntype: is also set for an outfile datatype definition. The known types are validated against a set of standard EMBOSS known types: a warning message is generated if the specified type does not match a standard name. See Section A.4, “Global Attributes”.

knowntype: should also be set on the filelist and dirlist datatypes.

4.3.6.3. Attributes for Output ACD Datatypes

These are used to specify the output file and in some cases exactly what data is generated and how it is validated.

4.3.6.3.1. Sequence and Sequence Feature Output

For sequence and feature output datatypes (featout, seqout, seqoutset and seqoutall), if the name: attribute is not defined in the ACD file, it will default to the name of the first sequence that is read in (if available). This is equivalent to the calculated attribute name: of the input sequence. See: Section A.6, “Calculated Attributes”

The type of sequence can be restricted by setting type:. The application will then validate that the output is of the specified type. The type must be a standard sequence type, see Section A.7, “Sequence Types”.

If no type is specified for an output sequence, and there is no sequence input from which to take a default type, then an error will be generated during ACD processing.

4.3.6.3.2. Feature Output

Sequence features can be written alongside an output sequence (seqout, seqoutall and seqoutset datatypes) if their features: attribute is set. For all these datatypes, the default data format can be specified with osformat: which the -osformat associated qualifier can override.

The type of features can be restricted by setting type:. There is a default based on the type of an input sequence (where used), but it should be specified for validation purposes. The feature type must be one of protein or nucleotide. If no type is specified for output features, and there is no sequence input from which to take a default type, then an error will be generated during ACD processing.

4.3.6.3.3. Alignment Output

Any seqset or seqsetall datatype must have the aligned: attribute set: an error will be generated during ACD processing otherwise. Handling of sequence alignments is covered in detail elsewhere (Section 6.11, “Handling Alignments”).

Alignment format can be set with the aformat: attribute.

For an align data definition, minseqs: and maxseqs: set the expected minimum and maximum number of sequences. The multiple: boolean attribute should be set to "Y" if the output can contain more than one alignment from the same input.

4.3.6.3.4. Report Output

rformat: specifies the report format to use, which must be one of the supported report formats (see the EMBOSS Users Guide).

multiple: is a boolean attribute which should be set to "Y" if the output can contain more than one report from the same input.

type: is defined as one of "protein" or "nucleotide" where the report format is one of the standard feature table formats (see the EMBOSS Users Guide).

taglist: defines the tag / value pairs from the internal feature table to be reported in the output.

4.3.6.3.5. Naming of Output Files

name: and extension: are available for many datatypes. The output filename is constructed from the name: and extension: values and has the format:

name.extension

In some cases the output file name and extension have default values. For example, there is a choice of output formats for the outcodon datatype. The name: attribute defaults to outfile and the extension: attribute defaults to the format name, with cut defined as the default format to match the usual codon usage file-naming convention.

In cases where a default file naming scheme is available, but a default value is also specified in the ACD file, the default value can be overridden and the naming scheme used if a null value ("") for the parameter is given on the command line. It's necessary to set the global attribute missing: for such data definitions.

For some datatypes, datatype-specific command line qualifiers (see the EMBOSS Users Guide) are available and can be used to name the output file, either by specifying the qualifier on the command line or by hard-coding it as an attribute in the ACD file. For example, an alignment filename with the format aname.aextension is constructed if the qualifiers -aname and -aextension are specified. Values may be hard-coded with the corresponding aname: and aextension: attributes.

A default value for the output file name can also be set by defining the default: global attribute.

4.3.6.3.6. Missing Output Files

nullok: and nulldefault: are available for all output datatypes. nullok: is a boolean attribute and specifies whether a missing output file is acceptable to the application. If the application can accept a null value and can run without generating the corresponding output file, the nullok: attribute must be set to "Y". This allows a default value to be omitted or the application to be run with -noFlag (where Flag is the data definition flag) to specify the output file is not to be generated.

nulldefault: overrides the default name generation, and uses an empty string (no output file) as the default for programs where an alignment file is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead.

4.3.6.3.7. Output Datatype Identification

The outfile datatype is used for general application output. The type of data is identified using knowntype:. This allows the known type of an outfile to be matched to that of an infile data definition to identify cases where the output of one application can be used as the input to another. The known types are validated against a set of standard EMBOSS knowntypes (): a warning message is generated if the specified type does not match a standard name.

4.3.6.3.8. Specification of Output Data

Attributes for some datatypes are used to control exactly what data is generated and to validate that data. For example an alignment output file is defined in the same way as a plain output file (outfile datatype) but has extra attributes to allow a choice of alignment formats and to specify, for validation purposes, the expected number of aligned sequences. Similarly, a report output file is defined in the same way as a plain output file (outfile) but has extra qualifiers to allow a choice of report formats.

For all of the phylip datatypes the outmatrix and outmatrixf datatypes are available. The default data format can be specified with oformat: which the -oformat associated qualifier can override.

The default data format of codon, cpdb and scop datatypes can be specified with oformat: which the -oformat associated qualifier can override.

4.3.6.4. Attributes for Selection ACD Datatypes

These control the aspects of the menus (lists of options) the user is presented with. For example, the text appearing above the menu, and how user input is validated, the minimum and maximum number of selections and whether the options are case-sensitive.

Note

The information: attribute, definable for all datatypes, defines text to be used as a prompt after the list.

For example, consider this ACD list definition:

list: matrix 
[
     default: "blosum"               # default value
     minimum: 1 maximum: 1           # must select exactly 1
     header: "Comparison matrices"   # printed before list
     values: "B:blosum, P:pam, I:id" # 3 valid values
     delim: ","                      # delimiter default ";"
     codedelim: ":"                  # label delimiter default ":"
     prompt: "Select one"            # prompt after list
     button: Y                       # use radio buttons rather than checkboxes in HTML, ignored by ACD.
]

What you get on screen is:

Comparison matrices

      B : blosum
      P : pam
      I : id

Select one [blosum] : PAM

With this ACD list definition:

select: matrix 
[
     default: "blosum"             # default value
     minimum: "1" maximum: "1"     # must select exactly 1
     header: "Comparison matrices" # printed before list
     values: "blosum, pam, id"     # valid values
     delimiter: ","                # delimiter default ";"
     information: "Select one"     # prompt after list
     button: "Y"                   # use radio buttons rather than checkboxes in HTML, ignored by ACD
]

You get:

Comparison matrices

      1 : blosum
      2 : pam
      3 : id

Select one [blosum] : PAM

4.3.6.5. Attributes for Graphics ACD Datatypes

These specify the graph data that are output.

gtitle: specifies the graph title for a graph or graphxy datatype. Many other graphical elements can be set.

multiple: specifies the number of multiple graphs in a single graph or graphxy datatype output.

The nullok: and nulldefault: attributes are available for both graphics datatypes. They have exactly the same meaning as for the output datatypes (see above).

4.3.7. Attributes for Datatype-associated Qualifiers

Various command line qualifiers are inbuilt for certain ACD datatypes. These datatype-associated qualifiers are normally specified only on the command line. They may, however, also be "hard coded" as attributes in the appropriate data definition and it is sometimes desirable to do this. Some examples for setting file format are shown below:

Input sequence

-sformat

sformat:

Output sequence

-osformat

osformat:

Output alignment

-aformat

aformat:)

Report output file

-rformat

rformat:)

Sequence feature input

-fformat

fformat:)

Sequence feature output

-offormat

offormat:)

The qualifiers are described in more detail in the EMBOSS Users Guide.

4.3.8. Introduction to Calculated Attributes

Calculated attributes are datatype-specific attributes that are assigned values after an input file has been read during ACD file processing. The values are calculated or extracted from the actual data that an ACD data definition refers to.

Most calculated attributes are for datatypes for sequence input (sequence, seqall, seqset and seqsetall) and sequence feature input (features). For instance, for a sequence datatype, the length and type of sequence are available once the sequence file or an entry from it has been read.

For a complete description of the available calculated attributes see Section A.6, “Calculated Attributes”.

4.3.8.1. Retrieving Values of Calculated Attributes

Values of calculated attributes are retrieved by an operation (Section 4.4, “Operations”) from within the ACD file. The operation uses a ParameterName.AttributeName term surrounded by parentheses with a dollar sign ($) at the front:

$(ParameterName.CalculatedAttributeName)

The $ syntax means "get the value of", in this case the term enclosed by parentheses.

This is typically done from within a data definition that relies on the value of a calculated attribute from some other data definition. For example, the following ACD file excerpt defines a sequence parameter and an integer window:

sequence: sequence 
[
  parameter: "Y"
  type: pureprotein
]

integer: window 
[
  standard: "y"
  default: "10"
  maximum: "$(sequence.length)"
]

The maximum window size (maximum attribute of the window datatype) is being set to the length of the sequence by using maximum: "$(sequence.length)".

4.3.8.2. Sequence Calculated Attributes

All of the sequence input datatypes:

sequence
seqall
seqset
seqsetall

have six calculated attributes:

begin

Start residue (-sbegin value)

end

End residue (-send value)

length

Length

protein

Boolean, True if sequence is protein

nucleic

Boolean, True if sequence is nucleic

name

Name

Let's assume we've defined a sequence input called asequence and have an ACD file looking something like:

sequence: asequence 
[
  parameter: "Y"
  type: protein
]

The attributes would be referred to as follows:

  • asequence.begin

  • asequence.end

  • asequence.length

  • asequence.protein

  • asequence.nucleic

  • asequence.name

These are properties of an input sequence that can be queried within ACD.

Note

The value of the begin and end attributes can be set on the command line by specifying the -sbegin or -send which are inbuilt qualifiers for the sequence types.

The seqset and seqsetall datatypes have the following calculated attributes :

totweight

Total sequence weight for the set(s) of sequences

count

Number of sequences in the set(s)

seqsetall also has :

multicount

Number of sets of sequences

To retrieve calculated attribute values from the definition of asequence shown above you can use:

$(asequence.begin)
$(asequence.end)
$(asequence.length)

Example. When writing a program to insert one sequence into another, one way to make sure that the insertion position isn't greater than the length of the first sequence is to use code like the following:

    if(position > ajSeqGetLen(seq))
       ajFatal("Insertion position out of bounds");

The problem is that the program can prematurely terminate after the user has gone to all the effort of configuring the application (entering all the inputs). What would be better is if the interface forced the correct input, and there is a way to achieve that by using calculated attributes in the ACD file itself.

Now, instead of hard-coding the check for the sequence insertion position you just need to add:

maximum: $(asequence.end)

to the integer definition of the insert position. An ACD file with two input sequences and an integer insert position might look something like:

sequence: targetsequence 
[
  parameter: "Y"
  type: "protein"
]

sequence: insertsequence 
[
  parameter: "Y"
  type: "protein"
]

integer: position
[
  parameter: "Y"
  maximum: $(targetsequence.end)
]

You may of course retrieve the value of calculated attributes from both sequences. In the following example, a sequence window is defined which cannot be any longer than the sum of the lengths of the two sequences. A calculation is used, which therefore has to be enclosed by parentheses and started with the 'at' symbol ('@'):

integer: window
[
  parameter: "Y"
  maximum: @($(targetsequence.length) + $(insertsequence.length))
]

These calculated attributes are also useful for conditional statements. In the example below the sequence input is of the any type. A ternary conditional (see Section 4.4, “Operations”) is being used to set a default substitution matrix based on the automatically determined type of sequence:

sequence: sequence 
[
  parameter: "Y"
  type: "any"
]

integer: penalty 
[
  parameter: "Y"
  default: "@($(sequence.protein) ? EBLOSUM62 : EDNAFULL)"
  etc
]

4.3.8.3. Feature Calculated Attributes

The sequence feature input datatype:

features

has seven calculated attributes:

fbegin

Start of the features to be used (-fbegin value)

fend

End of the features to be used (-fend value)

flength

Total length of sequence

fprotein

Boolean, True if feature table is protein

fnucleic

Boolean, True if feature table is nucleotide

fname

Name of the feature table

fsize

Number of features

These are properties of an input feature that can be queried within ACD.

Note

The value of the fbegin and fend attributes can be set on the command line by specifying -fbegin or -fend which are inbuilt qualifiers for the features datatype.

Assuming you've defined a feature input called afeatures and have an ACD file looking something like:

features: afeatures
[
  parameter: "Y"
]

The attributes would be referred to as follows:

  • afeatures.fbegin

  • afeatures.fend

and so on.

To retrieve calculated attribute values from your definition of afeatures above you can use:

$(afeatures.begin)
$(afeatures.end)
$(afeatures.length)

Example. When writing a program to compare the features of two sequences, which might be protein or nucleotide sequences, you need to ensure that both sequences are of the same type. One way to do this from the application code is to call an AJAX function to return the type of both, and exit if they are not the same:

if((ajFeattableIsProt(features1) && ajFeattableIsNuc(features2)) ||
  (ajFeattableIsNuc(features1) && ajFeattableIsProt(features2))
   ajFatal("Input feature tables are not of the same type");

The problem with this code is that the application will terminate once it is running if the feature tables are not of the same type. The user, having gone to the effort of configuring the application, will not be impressed! Far better if the interface forced the correct input before the application proper started, which can be achieved by using calculated attributes in the ACD file itself.

Assume the application takes two feature table parameters as input and the ACD file therefore looks something like:

features: afeatures
[
   parameter: "Y"
]

features: bfeatures
[
   parameter: "Y"
]

Instead of hard-coding the check for feature type, it can be enforced on the second of the features that is read: the first feature table can be of the any type but the second must match the first. This is achieved by declaring the type: attribute for the feature tables. The type: of the second feature table is set to that of the first by retrieving the fprotein calculated attribute of the first feature table in a ternary conditional (see Section 4.4, “Operations”):

features: afeatures
[
   parameter: "Y"
   type: "any"
]

features: bfeatures
[
   parameter: "Y"
   type: "@($(afeatures.fprotein) ? protein : nucleotide)"

]