When specifying values on the command line, the following rules apply:
Flags (qualifier or parameter names) can be shortened as long as they remain unambiguous.
Flags can appear in any order, although care must be taken with options of the same datatype (see Section 6.1.4.1, “Multiple Qualifiers”).
Datatype-specific qualifiers (specific for a certain datatype instance) should immediately follow an option with that datatype. In this position, these flags apply only to that option and not to all options with that datatype.
Flags must start with either the hyphen -
(UNIX style) or the forward slash /
(OpenVMS style), unless there is an =
sign between the qualifier/parameter name and the value (SeqPup command style).
The values are spaced from the qualifier/parameter name by either a space (UNIX style) or an =
sign (OpenVMS or SeqPup style).
If the equal sign (=
) is used to assign a value to a qualifier, the prefix hyphen (-
) or forward slash /
) can be omitted (SeqPup style).
Boolean (Yes/No, True/False) options have no attached value and are set True
by giving the qualifier/parameter name, and set to False
by adding the prefix no
to the name.
Values given after flags are not usually case sensitive. An obvious exception is filenames, which must match in normal UNIX style (on normal UNIX systems).
The value that must be given depends upon the ACD datatype of the option in question (see below). For convenience, the available ACD datatypes (and hence options) are organised into five groupings, reflecting similar properties or modes of usage:
Simple Datatypes
Input Datatypes
Selection Datatypes
Output Datatypes
Graphics Datatypes
The simple datatypes include primitive types such as string and integer, and more complex datatypes such as ranges.
Primitive ACD datatypes include:
boolean
Simple boolean value
float
Simple floating point number
integer
Simple integer number
string
Simple string
toggle
Simple boolean switch for controlling other parameters
The data value is "true" or "false" and is specified as follows:
"Y" "yes" "true" "N" "no" "false"
The value will be "Y"
when the parameter name is entered on the command line as a flag, for example:
-ToggleOption |
If the qualifier is absent from the command line the default value is used. The flag can also be prefixed by no
, for example:
-noToggleOption |
to force the value to be "N"
. This is needed if the default value is "Y"
.
The data value is any valid ASCII text string which should be enclosed in quotes. For example:
"This is a valid text string"
The data value is "true" or "false" and is specified as follows:
"Y" "yes" "true" "N" "no" "false"
Toggle
parameters work in exactly the same way as boolean
parameters (see above) but are used to control prompting for other parameters (turn prompting on or off). See the EMBOSS Developers Guide for further information.
Other simple datatypes include:
array
List of either integer or floating point numbers
range
Range of sequence positions
regexp
Regular expression pattern
pattern
A sequence pattern
The data value is a list of numbers separated by spaces or commas. For example:
"1 2 3 4 5" "1.5, 2.0, 2.5, 3.0"
One or more ranges may be defined on the command line or in a range file.
On the command line, a range is defined by a pair of integer numbers and multiple ranges may be given. The numbers may be delimited by any non-digit, non-alphabetic character. For example:
"24-45, 56-78" "1:45, 67=99;765..888" "1,5,8,10,23,45,57,99"
A range file contains a list of pairs of numbers with optional text comments. One pair of numbers must be given per line and the file can contain comment lines which are preceded with a '#
' character. For example:
# A set of ranges in a range file. 12 23 4 5 This is an optional comment. 67 10348 Another comment.
Range files are specified on the command line by preceding the filename with @
. For example, for the range file RangeFileName
:
@RangeFileName |
In cases where the numbers are sequence positions, the upper and lower bounds will in practice depend on the length of the sequence to which they are applied. You should bear in mind that sequence positions can be negative, in which case they count back from the end of the sequence.
EMBOSS uses the "Perl-Compatible Regular Expression Library" (PCRE) release 4.3 to process regular expressions, so any regular expression that is valid in Perl 5.0 (http://search.cpan.org/~nwclark/perl-5.8.7/pod/perlre.pod) should be valid here.
The input datatypes cater for input of sequences, sequence features, files and directories, inputs specific to EMBASSY packages (e.g. phylipnew), data files and other files of biological data.
Input datatypes for handling biological sequences include:
sequence
A single sequence for reading
seqall
A set of single sequences that are addressed one after another
seqset
A set of single sequences that can be used all at the same time
seqsetall
One or more sets of single sequences that can be used all at the same time
The data value in all cases is the Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”) of one or more sequences. The USA might specify a literal sequence, database reference, file or some other sequence reference.
The data value is the USA of a set of sequences to be read one at a time. For example, the USA might specify a sequence database for sequential reading of entries.
The data value is the USA of a set of single sequences. For example, a set of sequences from a multiple alignment file, or sequences from a database.
The data value is the USA of one or more sets of single sequences. For example, sets of sequences from two databases or two alignment files. The data value would typically be a listfile: a file containing a list of USAs (see Section 6.6, “The Uniform Sequence Address (USA)”).
There is a single datatype for handling biological sequence features input:
features
Sequence feature annotation in any known feature format
The data value is the name of a features file. A features file contains sequence feature information. Several feature formats are supported (Section A.2, “Supported Feature Formats”).
Input datatypes for handling general files and directories include:
directory
A directory that can be used for input or output
dirlist
A list of file names that are read from a directory
filelist
A list of input files
infile
Non-sequence-related input file
The data value is the name of any valid directory. For example:
"." "/data" "/data/sequences"
The data value is the name of any valid directory. For example:
"." "/data" "/data/sequences"
The data value is a list of file names separated by commas. For example:
"../data/file1.dat, file2.dat"
Input datatypes for handling data files include:
datafile
A formatted data file read from the standard EMBOSS data search path
matrix
Comparison matrix file (integer values)
matrixf
Comparison matrix file (floating point values)
In all cases, the data value is the name of a file in the EMBOSS data search path (Section 2.8, “Maintenance”).
Typically where a comparison matrix is specified, gap penalties will also be required. These must be specified separately in one or more other data definitions (see the EMBOSS Developers Guide). The matrix files distributed with BLAST are also distributed with EMBOSS in the EMBOSS data directory.
The data value is the name of a data file. Many data files already have their own ACD datatype, for example, matrix
, matrixf
and codon
. Other data files do not have or need their own ACD definition and datafile
is used for these.
The data value is the name of an integer comparison matrix file. Applications using integer matrices are usually faster than those using floating point matrices.
The data value is the name of a floating point comparison matrix file in the EMBOSS data search path (Section 2.8, “Maintenance”).
The matrixf
datatype defines floating point matrices, which usually involve slower calculation times than integer matrices. An integer matrix file can of course also be read as floating point.
Input datatypes specific to the phylipnew EMBASSY package are given below. These provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
discretestates
Discrete states file
distances
Distance matrix
frequencies
Frequency value(s)
properties
Property value(s)
tree
Phylogenetic tree
The data value is the name of a discrete states file and is used by the phylip "discrete character" applications.
The data value is the name of a distances file as used by the phylip "distance matrix" applications.
The data value is the name of a frequencies file as used by the phylip "gene frequency and continuous character" applications.
The data value is the name of a properties file as used by the phylip applications to define weights, ancestral states and factors (multi-state characters).
Other biological input datatypes include:
codon
Codon usage table file
cpdb
Protein coordinate data in a simple file format (clean coordinate file)
scop
SCOP and CATH domain classification information in a simple file format (domain classification file)
The data value is the name of a codon usage table file in the EMBOSS data search path (Section 2.8, “Maintenance”).
Codon usage files are distributed in the EMBOSS data directory. They are ASCII text files and can be read in several formats.
The data value is the name of a CCF file.
CCF (clean coordinate file) format is a simple "clean" file format for protein and domain coordinate data. See the documentation for pdbparse, part of the EMBASSY domainatrix package, which generates CCF files from PDB file input.
The data value is the name of a DCF file.
DCF (domain classification file) format is a simple "clean" file format for domain classification data. See the documentation for domainer, part of the EMBASSY domainatrix package, which generates DCF files from SCOP and CATH file input.
The standard IUPAC one-letter codes for the amino acids and nucleotides are used. The symbol x
is used for a position where any amino acid is accepted. The symbol n
is used for a position where any nucleotide is accepted.
Ambiguities are indicated by listing the acceptable amino acids or bases for a given position, between square parentheses [ ]
. For example:
[ALT] |
stands, in the case of proteins, for Ala or Leu or Thr.
Ambiguities are also indicated by listing between a pair of curly brackets { }
the amino acids or bases that are not accepted at a given position. For example:
{AM} |
stands, in the case of proteins, for any amino acid except Ala and Met.
Each element in a pattern is separated from its neighbor by a '-
' (dash). Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. For example:
x(3) corresponds to x-x-x |
x(2,4) corresponds to x-x or x-x-x or x-x-x-x |
When a pattern is restricted to either the N- or C-terminal (5' or 3') of a sequence, that pattern either starts with a '<
' (reverse chevron) symbol or respectively ends with a '>
' (forward chevron) symbol. A period ends the pattern (in most cases optionally). For example:
[DE](2)HS{P}X(2)PX(2,4)C. |
The output datatypes cater for output of sequences, sequence features, alignments, files and directories, outputs specific to EMBASSY packages (e.g. phylipnew), data files, other files of biological data and formatted application output files (reports).
Output datatypes for handling biological sequences include:
seqout
Output file for single sequence
seqoutall
Output file for multiple sequences
seqoutset
A set of single sequences stored in memory together, to be written to a file
The behaviour of these datatypes is identical but they provided for consistency with the input sequence datatypes (see above).
The data value in all cases is the USA (Section 6.6, “The Uniform Sequence Address (USA)”) of an output sequence stream. FASTA format is used by default for the output sequence(s). The format is normally set at the command line but a default may be hard-coded with osformat:
in an ACD file.
The data value is a USA for a single output sequence, for example, the name of a file.
The data value is a USA for multiple output sequences, for example, the name of a file.
There is a single output datatype for handling biological sequence features:
featout
Output file for sequence feature annotation
The data value is any valid file name. The data is stored as a feature table. Most common sequence feature formats are supported (Section A.2, “Supported Feature Formats”).
GFF format is used by default for the output feature(s). The format is normally set at the command line but a default may be hard-coded in the ACD file using the offormat:
attribute.
There is a single output datatype for handling alignments:
align
Output file for sequence alignments
An alignment output file is defined in the same way as a plain output file (outfile
datatype) but has extra qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) to allow a choice of alignment formats and attributes. These can specify whether the alignment will have 2 or more sequences (which limits the possible formats).
The data value is any valid file name. The data is stored as sequences and all of the common alignment formats are supported (Section A.3, “Supported Alignment Formats”).
Output datatypes for handling general files and directories of files include:
outdir
Output directory for the writing of multiple output files
outfile
General output file
outfileall
Multiple general output files
outfile
and outfileall
are used for data not catered for by some other output ACD datatype. For example, the output file would not normally contain sequence data. They are suitable for general application output in plain text.
The data value is the name of any valid directory. For example:
"." "/data" "/data/sequences"
Output datatypes for handling data files include:
outdata
Output file for data formatted cleanly as a table or list
outmatrix
Output file for integer comparison matrix data
outmatrixf
Output file for floating point comparison matrix data
In all cases the data value is any valid file name.
The output corresponding to multiple outdata
definitions in an ACD file are appended to a single file. The individual ACD definitions allow the format of each file section to be defined.
The data value is the name of an integer substitution matrix in the EMBOSS data search path (Section 2.8, “Maintenance”).
The data value is the name of a floating point substitution matrix in the EMBOSS data search path (Section 2.8, “Maintenance”).
Output datatypes specific to the phylipnew EMBASSY package are given below. By defining specific ACD datatypes for phylipnew EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
outdiscrete
Output file for phylogenetics discrete characteristics data
outdistance
Output file for phylogenetics distance matrix data
outfreq
Output file for phylogenetics character frequency data
outproperties
Output file for phylogenetics property data
outtree
Output file for phylogenetic tree data
In all cases, the data value is any valid file name.
Other biological output datatypes include:
outcodon
Output file for codon usage data
outcpdb
Output file for protein coordinate data in CCF (clean coordinate file) format
outscop
Output file for SCOP and CATH domain classification information in DCF (domain classification file) format
The data value is any valid file name.
The data value is a name for the codon usage output file.
The data is stored as a codon usage table. Codon usage table files are ASCII text files and can be written in several formats.
The data value is a name for the CCF output file.
CCF (clean coordinate file) format is a simple "clean" file format for protein and domain coordinate data. See the documentation for pdbparse, part of the EMBASSY domainatrix package, which generates CCF files from PDB file input.
The data value is a name for the DCF output file.
DCF (domain classification file) format is a simple "clean" file format for domain classification data. See the documentation for domainer, part of the EMBASSY domainatrix package, which generates DCF files from SCOP and CATH file input.
The datatype for handling formatted application output is:
report
Output file for sequence annotation
The data value is any valid file name.
Report data is stored internally as a feature table, so the available formats (Section A.4, “Supported Report Formats”) include the most common feature formats.
A report file is defined in the same way as a plain output file (Outfile
) but has extra qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) to allow a choice of report formats.
Two datatypes cater for menus. In either case, you'll be presented with a limited list of options, each with a label and descriptive text, to choose from.
list
A list of options (typically terse text descriptions) with text labels
selection
A list of options (typically verbose text descriptions) with automatically-generated numerical labels
The data value is one (or more) of the valid options. An option is specified by label (whether text or numerical) or by a non-ambiguous part of the descriptive text itself given after the label. If multiple selections are allowed, you must supply a comma-separated list of options.
Here is the prompt for a list
datatype:
Translation frames 1 1 2 2 3 3 F Forward three frames -1 -1 -2 -2 -3 -3 R Reverse three frames 6 All three frames Frame(s) to translate[1]:
Assuming a single selection only is allowed, these are all valid selections:
"1" "F" "Forward" "For" "R" "Reverse" "Rev"
The graphics datatypes cater for graphical output:
graph
Graphical output of any general kind
xygraph
Graphical output as a simple two dimensional (2D) XY plot with the sequence along the x-axis
The data value is the graphics device, as limited by the PLPLOT graphics library currently used by EMBOSS. The currently supported devices include ps
for Postscript, png
for PNG files, and X11 for X-Windows. A value of ?
in answer to the prompt will list the available graphics devices on your installation:
"ps" "png" "X11" "gif" "ps" "cps" "?"
The data value is the graphics device for a general graph. dotplots may be generated with the graph
datatype.