Every application option must be defined in the ACD file. The definition of these options (or "data definitions") follow the application definition in the file. All data definitions must be contained within an appropriate ACD file section (see Section 4.1.5, “ACD File Sections”). An error will be generated during ACD processing otherwise.
The general format for data definitions is as follows. The first token is an ACD datatype, the second token the option name, followed by data attributes given between square brackets. Each attribute is a
pair: name: value
Datatype
:OptionName
[DataAttribute1Name
: "DataAttribute1Value
"DataAttribute2Name
: "DataAttribute2Value
" ]
must be a valid ACD datatype (Section A.2, “Datatypes”).Datatype
These are predefined and include simple types such as integer
and float
, and biological types such as sequence
. Other types are available to define such things as menus (e.g. list
) and control whether values of other data definitions are prompted for (e.g. toggle
).
is the name of the option and is synonymous with the qualifier name, parameter name or command line flag. It is used to refer to the data definition from the command line. The value of any option can be set on the command line if the flag is specified before it: OptionName
- |
The programmer must have a handle on each application option from within the C source code and the flag is used for this purpose too (see Section 6.3, “Handling ACD Files”). The option name can be any string you like within reason, but conforming to certain conventions (Section 4.3.2, “Parameter Naming Conventions”).
The attributes (Section 4.3.4, “Types of Data Attributes”) give you control over the application parameters. They allow you to specify such things as the user-prompt itself and 'help' documentation for each parameter or qualifier. Default values and the requirements for a correct value, such as a permissible ranges of values (maxima and minima) can be set. They also allow the application programmer to control how the user is prompted for values, if at all (Section 4.5, “Controlling the Prompt”).
The attribute values are the strings enclosed in double quotes and may be parsed during ACD file processing into some other type depending on the data definition in which they occur. For example, this defines a default value:
default: "5.0"
If the above appeared in the definition for a floating point number (float:
) then it would be held internally as a floating point number.
Example. wossname has a string input parameter (search
) and the string definition has five attributes (parameter:
, default:
, information:
, help:
and knowntype:
):
string: search [ parameter: "Y" default: "" information: "Text to search for, or blank to list all programs" help: "Enter a word or words here and a case-independent search for it will be made in the one-line documentation, group names and keywords of all of the EMBOSS programs. If no keyword is specified, all programs will be listed." knowntype: "emboss keyword" ]
The qualifier or parameter name (the command line flag) of an ACD data definition must be a lower case alphanumerical string without whitespace characters. In theory it can be of any length. In practice, there are conventions (Section A.1, “Introduction to ACD Syntax”).
Some conventions are general and some specific to individual datatypes; it is strongly recommended that you follow them. In some cases a warning or error message will be generated during ACD processing (i.e. when the ACD file is parsed, typically after the application is run) if you do not.
For example, the conventional name for the sequence
datatype (input sequence) is sequence
or *sequence
and for seqout
(output sequence) is outseq
or *outseq
. In other words, where more than one instance of a datatype is specified in an ACD file, then the characters a
, b
etc can be prepended to the flag. For example, use asequence
, bsequence
etc where more than one sequence is specified, or aoutseq
, boutseq
etc where more than one sequence output stream is specified. You can see this in the matcher application, which takes two input sequences:
sequence: asequence [ parameter: "Y" type: "protein" ] sequence: bsequence [ parameter: "Y" type: "stopprotein" ]
For convenience the available ACD datatypes are organised into five groupings reflecting similar properties or modes of usage as follows:
Simple Datatypes
Input Datatypes
Selection Datatypes
Output Datatypes
Graphics Datatypes
The groupings are described below and in some cases are subdivided further for convenience. See also a complete description of the available datatypes including their data value, default value and key attributes (Section A.2, “Datatypes”).
The simple datatypes (Section A.2.1, “Description of Simple ACD Datatypes”) include primitive and some derived datatypes:
boolean
Simple boolean value
float
Simple floating point number
integer
Simple integer number
string
Simple string
toggle
Simple boolean switch for controlling other parameters
array
List of either integer or floating point numbers
range
Range of sequence positions
regexp
Regular expression pattern
pattern
Sequence pattern
The input datatypes (Section A.2.2, “Description of Input ACD Datatypes”) cater for input of sequences, sequence features, files and directories, inputs specific to the phylipnew EMBASSY package, data files and other files of biological data.
Input datatypes for handling biological sequences include:
sequence
A single sequence for reading
seqall
A set of single sequences that are addressed one after another
seqset
A set of single sequences that can be used all at the same time
seqsetall
One or more sets of single sequences that can be used all at the same time
The data value in all cases is the Uniform Sequence Address (USA) of an input sequence stream. The USA specification is described in the EMBOSS Users Guide.
There is a single datatype for handling biological sequence features input:
features
Sequence feature annotation in any supported feature format
The data value in all cases is the Uniform Feature Object (UFO) of an input sequence feature stream. The UFO specification is described in the EMBOSS Users Guide.
Input datatypes for handling general files and directories of files include:
directory
A directory that can be used for input or output
dirlist
A list of file names that are read from a directory
filelist
A list of input files
infile
Non-sequence-related input file
In all cases, the data value is any valid file or directory name.
Input datatypes for handling data files include:
datafile
A formatted data file read from the standard EMBOSS data search path
matrix
Comparison matrix file (integer values)
matrixf
Comparison matrix file (floating point values)
In all cases, the data value is the name of a file in the EMBOSS data search path (see the EMBOSS Users Guide).
Typically where a comparison matrix is specified, gap penalties will also be required. These must be specified separately in one or more other data definitions.
Input datatypes specific to the phylipnew EMBASSY package are given below. They allow detailed type checking, and to automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
discretestates
Discrete states file
distances
Distance matrix
frequencies
Frequency value(s)
properties
Property value(s)
tree
Phylogenetic tree
In all cases, the data value is any valid file name.
Other biological input datatypes include:
codon
Codon usage table file
cpdb
Protein coordinate data in a simple file format (clean coordinate file)
scop
SCOP and CATH domain classification information in a simple file format (domain classification file)
In all cases, the data value is any valid file name.
The output datatypes cater for output of sequences, sequence features, alignments, files and directories, outputs specific to the phylipnew EMBASSY package, data files, other files of biological data and formatted application output files (reports). See Section A.2.3, “Description of Output ACD Datatypes”.
Output datatypes for handling biological sequences include:
seqout
Output file for single sequence
seqoutall
Output file for multiple sequences
seqoutset
A set of single sequences stored in memory together, to be written to file
The behaviour of these datatypes is identical but they are provided for consistency with the input sequence datatypes (see above).
The data value in all cases is the Uniform Sequence Address (USA) of an output sequence stream (see the EMBOSS Users Guide).
FASTA format is used by default for the output sequence(s). The format is normally set at the command line but a default may be hard-coded with osformat:
(see Section 4.3.6.3.2, “Feature Output”).
There is a single output datatype for handling biological sequence features:
featout
Output file for sequence feature annotation
The data value is the Uniform Feature Object (UFO) of an output sequence feature stream (see the EMBOSS Users Guide).
There is a single output datatype for handling alignments:
align
Output file for sequence alignments
The data value is any valid file name.
Output datatypes for handling general files and directories of files include:
outdir
Output directory for writing of multiple output files
outfile
General output file
outfileall
Multiple general output files
outfile
and outfileall
are used for data not catered for by some other output ACD datatype. For example, the output file would not normally contain sequence data. They are suitable for general application output in plain text.
The data value is any valid file name.
Output datatypes for handling data files include:
outdata
Output file for data formatted cleanly as a table or list
outmatrix
Output file for integer comparison matrix data
outmatrixf
Output file for floating point comparison matrix data
In all cases the data value is any valid file name.
Output datatypes specific to the phylipnew EMBASSY package are given below. They allow EMBOSS to provide detailed type checking, and automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
outdiscrete
Output file for phylogenetic discrete characteristics data
outdistance
Output file for phylogenetic distance matrix data
outfreq
Output file for phylogenetic character frequency data
outproperties
Output file for phylogenetic property data
outtree
Output file for phylogenetic tree data
In all cases, the data value is any valid file name.
Other biological output datatypes include:
outcodon
Output file for codon usage data
outcpdb
Output file for protein coordinate data in CCD (clean coordinate file) format
outscop
Output file for SCOP and CATH domain classification information in DCF (domain classification file) format
The data value is any valid file name.
The selection datatypes cater for menus (Section A.2.4.2, “selection
”.
The user is presented with a limited list of options they can choose from. An option consists of a label and descriptive text. One or more options may be selected. Selection datatypes include:
list
A list of options (typically terse text descriptions) with text labels
selection
A list of options (typically verbose text descriptions) with automatically-generated numerical labels
The data value is one (or more) of the valid options. An option is specified by label (whether text or numerical) or by a non-ambiguous part of the descriptive text itself given after the label. If multiple selections are allowed then the user must supply a comma-separated list of options.
The graphics datatypes cater for graphical output (Section A.2.5, “Description of Graphics ACD Datatypes”.
Graphics datatypes include:
graph
Graphical output of any general kind
xygraph
Graphical output as a simple two dimensional (2D) XY plot with the sequence along the x-axis
The data value is a supported graphics device (Section A.2.5, “Description of Graphics ACD Datatypes”).
There are three basic types of attributes that may be defined for a data definition:
Global attributes (Section A.4, “Global Attributes”)
Datatype-specific attributes (Section A.5, “Datatype-specific Attributes”)
Calculated attributes (Section A.6, “Calculated Attributes”)
Additionally, there are various command line qualifiers that are inbuilt for certain ACD datatypes (see the EMBOSS Users Guide) which may also be defined as attributes in the appropriate data definition i.e.:
Datatype-associated qualifiers defined as attributes (Section A.2, “Datatypes”)
Global attributes are available for all datatypes and can be defined in any ACD data definition.
Datatype-specific attributes, in contrast, are available (can be defined) for certain datatypes only. Each ACD datatype has its own set of specific attributes. The values of global and datatype-specific attributes are set explicitly in the ACD file.
Calculated attributes are a type of datatype-specific attribute that are assigned values after the data definition has been processed (read and validated), for example, once a sequence has been read in from file. This allows the data definition to refer to attributes whose value is not known up front but depends on the input data, and is calculated automatically during ACD file processing.
Datatype-associated qualifiers (or simply associated qualifiers) are defined for single datatypes or groups of related ACD datatypes and are normally only specified on the command line. They may, however, also be "hard coded", i.e. defined as attributes in the appropriate data definition.
Global attributes are available for all datatypes. They may be defined in any ACD data definition and their value is set explicitly in the ACD file (Section A.4, “Global Attributes”).
Global attributes are defined with a token for the attribute name and a second token for the value of the attribute. The attribute name must be followed by a colon ':' and the value should be enclosed in double quotes (a warning will be generated during ACD processing otherwise):
GlobalAttributeName
: "Value
"
Most global attributes have string or boolean values. The booleans have a hard-coded default value which can be overridden by "Y"
, "Yes"
, "N"
or "No"
(the strings are case-insensitive). All of the following attributes are therefore valid:
parameter: "YES" parameter: "Yes" parameter: "Y" parameter: "NO" parameter: "No" parameter: "N"
The global attributes are summarised below. For convenience they are grouped by function as follows:
Parameters and qualifiers
User prompting
Datatype definition
Help information and documentation
Hints for GUIs
For use by SOAPLAB
Each application parameter, i.e. every data definition in the ACD file, can be defined via the appropriate global attribute to be one of a "parameter", "standard qualifier" or "additional qualifier" with the default of "advanced qualifier". These four types have different properties in terms of how they may be specified on the command line, how they are prompted for and the location of the help information. See Section 4.3, “Data Definition” for further information.
The attributes have boolean values and are defined as follows:
parameter: "Boolean
" ("N") standard: "Boolean
" ("N") additional: "Boolean
" ("N")
The default value is "N"
however this is never specified explicitly in the ACD file (see Section 4.3, “Data Definition”).
Reports. The report
datatype, where used, is typically a primary output of an application and, as such, should be defined as a parameter using the parameter:
attribute. The first report file must be defined as parameter:
(an error will be generated during ACD processing otherwise). It is recommended that subsequent report definitions (second, third and so on) are also defined as parameters (a warning will be generated during ACD processing if they are not). The exception is if the default:
or nullok:
attributes are set, in which case no warning or error messages are generated as the application can run with a default or without any value for the definition respectively.
Sequences. Sequence inputs (sequence
, seqall
, seqsetall
or seqset
ACD datatypes) and outputs (seqout
, seqoutall
and seqoutset
ACD datatypes) are typically the primary input or output of an EMBOSS application, and as such should be defined as parameters in the same way as for the report
datatype above.
Features. Sequence features, where used, are typically the primary input (feature
ACD datatype) or output (featout
), as such, should be defined as parameters in the same way as for the report
datatype.
Alignments. The align
datatype, where used, is typically the primary output and should be defined as a parameter. The first and subsequent alignment outputs should be defined as parameter: types, a warning will be generated during ACD processing if they are not. The exception is if the default:
or nullok:
attributes are set, in which case no warning or error messages are generated as the application can run with a default or without any value for the definition respectively.
Files. File datatypes, where there are used, are typically the primary input (infile
, filelist
, directory
or dirlist
ACD datatypes) or output (outdir
ACD datatype) and should be defined as parameters in the same way as for the report datatypes described above.
The attributes below are used to provide the text that will be used on the screen to prompt the user for values. Whether a prompt will appear at all depends on whether the option was defined as a parameter or one of the qualifier types (see above) and on the command used to invoke the application (i.e. what options were specified). In some cases, the prompt may or may not appear depending on the value of other inputs. For more information see Section 4.5, “Controlling the Prompt”).
The attributes are defined as follows (the strings are empty by default):
information: "String
" ("") code: "String
" ("") prompt: "String
" ("")
Only one of code:
, prompt:
or information
should ever be defined. The use of information:
(with a standard name, see below) is preferred. In practice, prompt:
is only ever required in the rare cases where the information:
string might be misleading.
To provide standard prompts a default value for the information:
string is defined for most common datatypes (see Section 4.3.5.2.1, “Standard Prompts File (codes.english
)”). The standard practice is, where possible, to use the default prompt for all input and output ACD datatypes. A warning will be generated during ACD processing if either the information:
attribute is missing or if the value used is not the standard value (where a standard value is available).
If a non-standard prompt is used, the text given after the information:
attribute should conform to certain conventions (Section A.4.4.1, “information: "
(String
"""
)”
In the example below a non-standard prompt is defined for asequence
. A warning will be generated if you try to run this:
sequence: asequence [ standard: "Y" information: "Enter filename" ]
In the next example, no information:
string is specified, therefore the standard prompt of "Read sequence from"
(from codes.english
) would be used instead:
sequence: asequence [ standard: "Y" ]
To provide standard prompts, a default value for the information:
string is defined for most common datatypes. The defaults are in the EMBOSS system file:
codes.english |
in the application ACD file directory, e.g.:
.../emboss/emboss/emboss/acd |
A default has the name DEF
where DatatypeName
is the name of the ACD datatype. The file also contains some additional standard prompts for specific instances of individual datatypes, identified by a code (not beginning with DatatypeName
DEF
). These prompts can be used with the code:
attribute, for example code: "GAP"
.
An excerpt of the file is shown below:
# # This is the EMBOSS coded prompts file. Any messages found here will # override those in the .acd file, and can be translated into other # languages. The file extension (default "english") is set by the # variable "emboss_language" # # Messages are referred to by .acd files as code: NAME # # DEFXXXX codes are automatically searched for ACD type XXXX if there # is no information (or prompt) defined in the ACD file of as a # standard prompt for the ACD type. # # HELPXXXX codes are automatically searched for ACD type XXXX if there # is no help text in the ACD file. # # Default prompts: these are used where no prompt for a data type # has been provided. DEFALIGN "Write output alignment to" DEFREPORT "Write output report to" DEFINTEGER "Enter a number" DEFFLOAT "Enter a number" DEFBOOL "Yes or No" . . . # Gap penalties GAP "What gap penalty" GAPEXT "What gap extension penalty"
Five global attributes are associated with, or describe, the data to which the ACD definition refers. For example, a parameter can be given a default value, be assigned to a known type taken from a controlled vocabulary, or its relations to other parameters described. The attributes are defined as follows:
knowntype: "String
" ("") default: "Value
" ("") relations: "String
" ("") outputmodifier: "Boolean
" ("N") missing: "Boolean
" ("N")
knowntype:
is a string describing the type of a data definition and should be defined where the type is not already clear from the datatype itself. It is typically defined for string:
, infile:
, outfile:
and outfileall
datatypes but not, for example, for a sequence:
.
Valid known type strings are listed in the file knowntypes.standard
(Section 4.3.5.3.1, “Application Data Known Types File (knowntypes.standard
)”). A few other values are accepted, for example "ApplicationName
output" for an outfile
datatype. These are documented with the datatypes (Section A.2, “Datatypes”.)
If a value is given that is not a standard known type or other accepted value, a warning message will be generated during ACD processing. The acdvalid utility will check all knowntype
values in an ACD file, and report any missing values for data definitions that require a known type.
The standard values (known types) are read from the EMBOSS system file:
knowntypes.standard |
which can be found in the application ACD file directory:
.../emboss/emboss/emboss/acd |
An excerpt of this file is shown below:
# Known Types # Knowntype_string Type Comment aaindex_data file AAINDEX entry aaindex_database file AAINDEX database abi_trace file ABI sequencing trace ajint_ajlong_data file Standard format info on ajint and ajlong alistat_input file HMMER alistat program input alistat_output file HMMER alistat program output amino_acid_classification file Amino acid chemical classes data . . .
Every EMBOSS application accepts the global qualifiers -help
and -verbose
, the latter used in combination with the former (if at all). When an application is run with -help
, helpful information is printed to the screen, including all the program parameters and qualifiers with explanatory text alongside.
The attributes below define the text that's used:
help: "String
" ("") valid: "String
" ("") expected: "String
" ("")
help:
is usually only defined if a deeper explanation of an application parameter is needed. If help:
is not defined, the value of the information:
attribute (if available) or the default help string (see below) will be used instead. valid:
and expected:
are used to describe the allowed and expected values for the online documentation. They are not usually required as, in most cases, a reasonable value is generated automatically.
An example where help:
is helpful:
integer: window [ standard: "N" default: 10 minimum: 5 maximum: 100 information: "Window size" help: "Number of residues used to calculate the value for each point" ]
The help:
string must conform to certain conventions (Section A.4.6.1, “help: "
(String
"""
)”).
Default help strings are given for each datatype in the EMBOSS system file codes.english
file (see Section 4.3.5.2.1, “Standard Prompts File (codes.english
)” above). They have the general form:
HELP |
where
is the name of the ACD datatype.DatatypeName
For example:
HELPSEQUENCE "Sequence USA" HELPDIRECTORY "Directory name"
The needed:
global attribute indicates whether a parameter is expected to be included in a GUI form. It is used to help GUI developers and more such attributes may be added in the future as support for external developers is improved. Certain datatype-specific attributes, for example button:
which can be defined for selection lists, are also used to provide such hints. See Section A.5, “Datatype-specific Attributes”.
The attribute is defined as follows:
needed: "Boolean
" ("Y")
The following attributes provide metadata for the SOAPLAB project. They are not used by EMBOSS directly and should not appear in standard EMBOSS application ACD files.
The attributes are defined as follows:
qualifier: "String
" ("") template: "String
" ("") comment: "String
" ("")
Datatype-specific attributes given within an ACD data definition define the characteristics of application options. They control such things as:
The name and location of input and output files
Detailed type-checking of inputs to ensure data is as required by the application
Exactly what data is generated on output
The appearance and behaviour of the user interface at the command line
Validation of user input, e.g. to ensure user input is within permissible ranges
Validation of application output, e.g. to ensure it is as expected
Suggestions of graphical aspects for other interfaces to EMBOSS applications
Datatype-specific attributes are defined for individual datatypes or for groups of related datatypes. The key attributes for each datatype are summarised below, organised by ACD datatype grouping ("simple", "input", "output" etc). For descriptions of all the available attributes see Section A.5, “Datatype-specific Attributes”.
These are used to validate user input.
The validation possible depends on the available attributes and therefore the datatype in question.
minimum:
and maximum:
can be defined for several datatypes and restrict the value within a given range. Where minimum:
and maximum:
attributes have calculated values (Section 4.4.4, “Calculations and Tests”) it is theoretically possible for the maximum to be less than the minimum. In such cases either the maximum or minimum might be required, depending on the application in question. The following attributes are used:
trueminimum:
. Boolean value; if "N"
the minimum value is used if the minimum and maximum values overlap.
failrange:
Boolean value; if "Y"
the application fails if the calculated ranges overlap.
rangemessage:
. String value; failure message to use if calculated ranges overlap.
An ACD file with a calculated range requires the failrange:
attribute to be specified or will yield a "failrange is required"
warning message. If you set failrange: "Y"
you need to define a message explaining to the end user why the range might fail. If you set failrange: "N"
the calculated range is accepted, but you also need to set trueminimum:
to say whether you want the minimum value to apply (usually to avoid getting negative values) or the maximum (to avoid values going too large). In both cases, warning messages are generated if the required attributes are not given.
Attributes to control the length (minlength:
and maxlength
) and case (upper: "Y"
or lower: "Y"
) of a regular expression or sequence pattern are available for regexp
and pattern
datatypes.
When a string
is defined, a known type for it should be specified using knowntype:
(see Section 4.3.5.3.1, “Application Data Known Types File (knowntypes.standard
)”): a warning message will be generated during ACD processing otherwise. A regular expression can be used to validate a string
if necessary.
These are used to specify the input data and to validate user input.
sask:
is available for all sequence inputs and sets the default for the -sask
qualifier. If set to "Y"
it specifies that a sequence begin and end position, and the reversing of a nucleotide sequence, will be prompted for.
nulldefault:
overrides the default name generation and uses an empty string (no sequence input) as the default, for programs where sequence input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok:
and missing:
, this allows qualifiers to be null by default, and turned on from the command line.
The sequence input datatypes (sequence
, seqall
, seqset
and seqsetall
) share a common set of attributes used to define and validate the sequence input. The type of sequence can be restricted with type:
, so that the program accepts, for example, only DNA sequences. The type must be a standard sequence type, for example dna
, pureprotein
, gaprna
etc. See Section A.7, “Sequence Types”.
For all the sequence inputs (sequence
, seqall
, seqset
and seqsetall
datatypes), sequence features (Section 6.9, “Handling Features”) can be read if the features:
attribute is set.
The type of features can be restricted by setting type:
. For example, the program can be made to accept only DNA features. The feature type must be one of protein
or nucleotide
. There is a default based on the type of an input sequence (where used), but a value should be specified. If no type is specified for input features and there is no sequence input from which to take a default type, then an error will be generated during ACD processing.
Attributes of the phylipnew package datatypes provide detailed type checking and can automatically detect and validate the various alternative formats that phylipnew supports, without the need for complex extra command line options. See Section A.5, “Datatype-specific Attributes”.
length:
specifies the number of property values per set (properties
datatype) or the number of frequency loci / values per set (frequencies
datatype).
size:
specifies the number of discrete state sets (discretestates
datatype), the number of frequency sets (frequencies
datatype) or the number of trees (tree
datatype).
The name
, extension
and directory
or fullpath
attributes are defined for several datatypes and define the name and location of the input data. In some cases a default naming scheme or a hard-coded default is available but can be overridden by using these attributes. A default value for the input file name can also be set by defining the default:
global attribute. Data files, for example, often have a hard-coded filename and you are free to define this using name:
or default:
.
In cases where a default file naming scheme is available, but a default value is also specified in the ACD file, then the default value in the ACD file can be overridden and the naming scheme used if a null value (""
) for the parameter is given on the command line. It's necessary to set the global attribute missing:
for such data definitions.
nullok:
is available for most input datatypes. It is a boolean attribute and specifies whether a missing input file is acceptable. If the application can accept a null value for this definition and can run without the corresponding input file, the nullok:
attribute must be set to "Y"
. This allows a default value to be omitted or the application to be run with -no
on the command line (where Flag
Flag
is the data definition flag) to specify the input file is not available.
The infile:
datatype is used for general application input. The type of data can be identified using knowntype:
. This allows inputs to be matched to outputs where knowntype:
is also set for an outfile
datatype definition. The known types are validated against a set of standard EMBOSS known types: a warning message is generated if the specified type does not match a standard name. See Section A.4, “Global Attributes”.
knowntype:
should also be set on the filelist
and dirlist
datatypes.
These are used to specify the output file and in some cases exactly what data is generated and how it is validated.
For sequence and feature output datatypes (featout
, seqout
, seqoutset
and seqoutall
), if the name:
attribute is not defined in the ACD file, it will default to the name of the first sequence that is read in (if available). This is equivalent to the calculated attribute name:
of the input sequence. See: Section A.6, “Calculated Attributes”
The type of sequence can be restricted by setting type:
. The application will then validate that the output is of the specified type. The type must be a standard sequence type, see Section A.7, “Sequence Types”.
If no type is specified for an output sequence, and there is no sequence input from which to take a default type, then an error will be generated during ACD processing.
Sequence features can be written alongside an output sequence (seqout
, seqoutall
and seqoutset
datatypes) if their features:
attribute is set. For all these datatypes, the default data format can be specified with osformat:
which the -osformat
associated qualifier can override.
The type of features can be restricted by setting type:
. There is a default based on the type of an input sequence (where used), but it should be specified for validation purposes. The feature type must be one of protein
or nucleotide
. If no type is specified for output features, and there is no sequence input from which to take a default type, then an error will be generated during ACD processing.
Any seqset
or seqsetall
datatype must have the aligned:
attribute set: an error will be generated during ACD processing otherwise. Handling of sequence alignments is covered in detail elsewhere (Section 6.11, “Handling Alignments”).
Alignment format can be set with the aformat:
attribute.
For an align
data definition, minseqs:
and maxseqs:
set the expected minimum and maximum number of sequences. The multiple:
boolean attribute should be set to "Y"
if the output can contain more than one alignment from the same input.
rformat:
specifies the report format to use, which must be one of the supported report formats (see the EMBOSS Users Guide).
multiple:
is a boolean attribute which should be set to "Y"
if the output can contain more than one report from the same input.
type:
is defined as one of "protein"
or "nucleotide"
where the report format is one of the standard feature table formats (see the EMBOSS Users Guide).
taglist:
defines the tag / value pairs from the internal feature table to be reported in the output.
name:
and extension:
are available for many datatypes. The output filename is constructed from the name:
and extension:
values and has the format:
name .extension |
In some cases the output file name and extension have default values. For example, there is a choice of output formats for the outcodon
datatype. The name:
attribute defaults to outfile
and the extension:
attribute defaults to the format name, with cut
defined as the default format to match the usual codon usage file-naming convention.
In cases where a default file naming scheme is available, but a default value is also specified in the ACD file, the default value can be overridden and the naming scheme used if a null value (""
) for the parameter is given on the command line. It's necessary to set the global attribute missing:
for such data definitions.
For some datatypes, datatype-specific command line qualifiers (see the EMBOSS Users Guide) are available and can be used to name the output file, either by specifying the qualifier on the command line or by hard-coding it as an attribute in the ACD file. For example, an alignment filename with the format aname
.aextension
is constructed if the qualifiers -aname
and -aextension
are specified. Values may be hard-coded with the corresponding aname:
and aextension:
attributes.
A default value for the output file name can also be set by defining the default:
global attribute.
nullok:
and nulldefault:
are available for all output datatypes. nullok:
is a boolean attribute and specifies whether a missing output file is acceptable to the application. If the application can accept a null value and can run without generating the corresponding output file, the nullok:
attribute must be set to "Y"
. This allows a default value to be omitted or the application to be run with -no
(where Flag
Flag
is the data definition flag) to specify the output file is not to be generated.
nulldefault:
overrides the default name generation, and uses an empty string (no output file) as the default for programs where an alignment file is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead.
The outfile
datatype is used for general application output. The type of data is identified using knowntype:
. This allows the known type of an outfile
to be matched to that of an infile
data definition to identify cases where the output of one application can be used as the input to another. The known types are validated against a set of standard EMBOSS knowntypes (): a warning message is generated if the specified type does not match a standard name.
Attributes for some datatypes are used to control exactly what data is generated and to validate that data. For example an alignment output file is defined in the same way as a plain output file (outfile
datatype) but has extra attributes to allow a choice of alignment formats and to specify, for validation purposes, the expected number of aligned sequences. Similarly, a report output file is defined in the same way as a plain output file (outfile
) but has extra qualifiers to allow a choice of report formats.
For all of the phylip datatypes the outmatrix
and outmatrixf
datatypes are available. The default data format can be specified with oformat:
which the -oformat
associated qualifier can override.
The default data format of codon
, cpdb
and scop
datatypes can be specified with oformat:
which the -oformat
associated qualifier can override.
These control the aspects of the menus (lists of options) the user is presented with. For example, the text appearing above the menu, and how user input is validated, the minimum and maximum number of selections and whether the options are case-sensitive.
The information:
attribute, definable for all datatypes, defines text to be used as a prompt after the list.
For example, consider this ACD list definition:
list: matrix [ default: "blosum" # default value minimum: 1 maximum: 1 # must select exactly 1 header: "Comparison matrices" # printed before list values: "B:blosum, P:pam, I:id" # 3 valid values delim: "," # delimiter default ";" codedelim: ":" # label delimiter default ":" prompt: "Select one" # prompt after list button: Y # use radio buttons rather than checkboxes in HTML, ignored by ACD. ]
What you get on screen is:
Comparison matrices B : blosum P : pam I : id Select one [blosum] : PAM
With this ACD list definition:
select: matrix [ default: "blosum" # default value minimum: "1" maximum: "1" # must select exactly 1 header: "Comparison matrices" # printed before list values: "blosum, pam, id" # valid values delimiter: "," # delimiter default ";" information: "Select one" # prompt after list button: "Y" # use radio buttons rather than checkboxes in HTML, ignored by ACD ]
You get:
Comparison matrices 1 : blosum 2 : pam 3 : id Select one [blosum] : PAM
These specify the graph data that are output.
gtitle:
specifies the graph title for a graph
or graphxy
datatype. Many other graphical elements can be set.
multiple:
specifies the number of multiple graphs in a single graph
or graphxy
datatype output.
The nullok:
and nulldefault:
attributes are available for both graphics datatypes. They have exactly the same meaning as for the output datatypes (see above).
Various command line qualifiers are inbuilt for certain ACD datatypes. These datatype-associated qualifiers are normally specified only on the command line. They may, however, also be "hard coded" as attributes in the appropriate data definition and it is sometimes desirable to do this. Some examples for setting file format are shown below:
-sformat
sformat:
-osformat
osformat:
-aformat
aformat:
)
-rformat
rformat:
)
-fformat
fformat:
)
-offormat
offormat:
)
The qualifiers are described in more detail in the EMBOSS Users Guide.
Calculated attributes are datatype-specific attributes that are assigned values after an input file has been read during ACD file processing. The values are calculated or extracted from the actual data that an ACD data definition refers to.
Most calculated attributes are for datatypes for sequence input (sequence
, seqall
, seqset
and seqsetall
) and sequence feature input (features
). For instance, for a sequence datatype, the length and type of sequence are available once the sequence file or an entry from it has been read.
For a complete description of the available calculated attributes see Section A.6, “Calculated Attributes”.
Values of calculated attributes are retrieved by an operation (Section 4.4, “Operations”) from within the ACD file. The operation uses a
term surrounded by parentheses with a dollar sign (ParameterName
.AttributeName
$
) at the front:
$(ParameterName
.CalculatedAttributeName
)
The $
syntax means "get the value of", in this case the term enclosed by parentheses.
This is typically done from within a data definition that relies on the value of a calculated attribute from some other data definition. For example, the following ACD file excerpt defines a sequence
parameter and an integer
window:
sequence: sequence [ parameter: "Y" type: pureprotein ] integer: window [ standard: "y" default: "10" maximum: "$(sequence.length)" ]
The maximum window size (maximum
attribute of the window
datatype) is being set to the length of the sequence by using maximum: "$(sequence.length)"
.
All of the sequence input datatypes:
sequence |
seqall |
seqset |
seqsetall |
have six calculated attributes:
begin
Start residue (-sbegin
value)
end
End residue (-send
value)
length
Length
protein
Boolean, True
if sequence is protein
nucleic
Boolean, True
if sequence is nucleic
name
Name
Let's assume we've defined a sequence input called asequence
and have an ACD file looking something like:
sequence: asequence [ parameter: "Y" type: protein ]
The attributes would be referred to as follows:
asequence.begin
asequence.end
asequence.length
asequence.protein
asequence.nucleic
asequence.name
These are properties of an input sequence that can be queried within ACD.
The value of the begin
and end
attributes can be set on the command line by specifying the -sbegin
or -send
which are inbuilt qualifiers for the sequence types.
The seqset
and seqsetall
datatypes have the following calculated attributes :
totweight
Total sequence weight for the set(s) of sequences
count
Number of sequences in the set(s)
seqsetall
also has :
multicount
Number of sets of sequences
To retrieve calculated attribute values from the definition of asequence
shown above you can use:
$(asequence.begin) $(asequence.end) $(asequence.length)
Example. When writing a program to insert one sequence into another, one way to make sure that the insertion position isn't greater than the length of the first sequence is to use code like the following:
if(position > ajSeqGetLen(seq)) ajFatal("Insertion position out of bounds");
The problem is that the program can prematurely terminate after the user has gone to all the effort of configuring the application (entering all the inputs). What would be better is if the interface forced the correct input, and there is a way to achieve that by using calculated attributes in the ACD file itself.
Now, instead of hard-coding the check for the sequence insertion position you just need to add:
maximum: $(asequence.end)
to the integer definition of the insert position. An ACD file with two input sequences and an integer insert position might look something like:
sequence: targetsequence [ parameter: "Y" type: "protein" ] sequence: insertsequence [ parameter: "Y" type: "protein" ] integer: position [ parameter: "Y" maximum: $(targetsequence.end) ]
You may of course retrieve the value of calculated attributes from both sequences. In the following example, a sequence window is defined which cannot be any longer than the sum of the lengths of the two sequences. A calculation is used, which therefore has to be enclosed by parentheses and started with the 'at' symbol ('@
'):
integer: window [ parameter: "Y" maximum: @($(targetsequence.length) + $(insertsequence.length)) ]
These calculated attributes are also useful for conditional statements. In the example below the sequence input is of the any
type. A ternary conditional (see Section 4.4, “Operations”) is being used to set a default substitution matrix based on the automatically determined type of sequence:
sequence: sequence [ parameter: "Y" type: "any" ] integer: penalty [ parameter: "Y" default: "@($(sequence.protein) ? EBLOSUM62 : EDNAFULL)" etc ]
The sequence feature input datatype:
features |
has seven calculated attributes:
fbegin
Start of the features to be used (-fbegin
value)
fend
End of the features to be used (-fend
value)
flength
Total length of sequence
fprotein
Boolean, True
if feature table is protein
fnucleic
Boolean, True
if feature table is nucleotide
fname
Name of the feature table
fsize
Number of features
These are properties of an input feature that can be queried within ACD.
The value of the fbegin
and fend
attributes can be set on the command line by specifying -fbegin
or -fend
which are inbuilt qualifiers for the features
datatype.
Assuming you've defined a feature input called afeatures
and have an ACD file looking something like:
features: afeatures [ parameter: "Y" ]
The attributes would be referred to as follows:
afeatures.fbegin
afeatures.fend
and so on.
To retrieve calculated attribute values from your definition of afeatures
above you can use:
$(afeatures.begin) $(afeatures.end) $(afeatures.length)
Example. When writing a program to compare the features of two sequences, which might be protein or nucleotide sequences, you need to ensure that both sequences are of the same type. One way to do this from the application code is to call an AJAX function to return the type of both, and exit if they are not the same:
if((ajFeattableIsProt(features1) && ajFeattableIsNuc(features2)) || (ajFeattableIsNuc(features1) && ajFeattableIsProt(features2)) ajFatal("Input feature tables are not of the same type");
The problem with this code is that the application will terminate once it is running if the feature tables are not of the same type. The user, having gone to the effort of configuring the application, will not be impressed! Far better if the interface forced the correct input before the application proper started, which can be achieved by using calculated attributes in the ACD file itself.
Assume the application takes two feature table parameters as input and the ACD file therefore looks something like:
features: afeatures [ parameter: "Y" ] features: bfeatures [ parameter: "Y" ]
Instead of hard-coding the check for feature type, it can be enforced on the second of the features that is read: the first feature table can be of the any
type but the second must match the first. This is achieved by declaring the type:
attribute for the feature tables. The type:
of the second feature table is set to that of the first by retrieving the fprotein
calculated attribute of the first feature table in a ternary conditional (see Section 4.4, “Operations”):
features: afeatures [ parameter: "Y" type: "any" ] features: bfeatures [ parameter: "Y" type: "@($(afeatures.fprotein) ? protein : nucleotide)" ]