6.2. Specifying Values for Application Options

6.2. Specifying Values for Application Options
Prev	Chapter 6. The EMBOSS Command Line	Next

6.2.1. General Rules

When specifying values on the command line, the following rules apply:

Flags (qualifier or parameter names) can be shortened as long as they remain unambiguous.
Flags can appear in any order, although care must be taken with options of the same datatype (see Section 6.1.4.1, “Multiple Qualifiers”).
Datatype-specific qualifiers (specific for a certain datatype instance) should immediately follow an option with that datatype. In this position, these flags apply only to that option and not to all options with that datatype.
Flags must start with either the hyphen - (UNIX style) or the forward slash / (OpenVMS style), unless there is an = sign between the qualifier/parameter name and the value (SeqPup command style).
The values are spaced from the qualifier/parameter name by either a space (UNIX style) or an = sign (OpenVMS or SeqPup style).
If the equal sign (=) is used to assign a value to a qualifier, the prefix hyphen (-) or forward slash /) can be omitted (SeqPup style).
Boolean (Yes/No, True/False) options have no attached value and are set True by giving the qualifier/parameter name, and set to False by adding the prefix no to the name.
Values given after flags are not usually case sensitive. An obvious exception is filenames, which must match in normal UNIX style (on normal UNIX systems).

The value that must be given depends upon the ACD datatype of the option in question (see below). For convenience, the available ACD datatypes (and hence options) are organised into five groupings, reflecting similar properties or modes of usage:

Simple Datatypes
Input Datatypes
Selection Datatypes
Output Datatypes
Graphics Datatypes

6.2.2. Simple ACD Datatypes

The simple datatypes include primitive types such as string and integer, and more complex datatypes such as ranges.

6.2.2.1. Primitive Datatypes

Primitive ACD datatypes include:

boolean: Simple boolean value
float: Simple floating point number
integer: Simple integer number
string: Simple string
toggle: Simple boolean switch for controlling other parameters

6.2.2.1.1. `boolean`

The data value is "true" or "false" and is specified as follows:

"Y"
"yes"
"true"
"N"
"no"
"false"

The value will be "Y" when the parameter name is entered on the command line as a flag, for example:

-ToggleOption

If the qualifier is absent from the command line the default value is used. The flag can also be prefixed by no, for example:

-noToggleOption

to force the value to be "N". This is needed if the default value is "Y".

6.2.2.1.2. `float`

The data value is any valid floating point number. For example:

"100.24"

6.2.2.1.3. `integer`

The data value is any integer value. For example:

"100"

6.2.2.1.4. `string`

The data value is any valid ASCII text string which should be enclosed in quotes. For example:

"This is a valid text string"

6.2.2.1.5. `toggle`

The data value is "true" or "false" and is specified as follows:

"Y"
"yes"
"true"
"N"
"no"
"false"

Toggle parameters work in exactly the same way as boolean parameters (see above) but are used to control prompting for other parameters (turn prompting on or off). See the EMBOSS Developers Guide for further information.

6.2.2.2. Other Simple Datatypes

Other simple datatypes include:

array: List of either integer or floating point numbers
range: Range of sequence positions
regexp: Regular expression pattern
pattern: A sequence pattern

6.2.2.2.1. `array`

The data value is a list of numbers separated by spaces or commas. For example:

"1 2 3 4 5"
"1.5, 2.0, 2.5, 3.0"

6.2.2.2.2. `range`

One or more ranges may be defined on the command line or in a range file.

On the command line, a range is defined by a pair of integer numbers and multiple ranges may be given. The numbers may be delimited by any non-digit, non-alphabetic character. For example:

"24-45, 56-78"
"1:45, 67=99;765..888"
"1,5,8,10,23,45,57,99"

A range file contains a list of pairs of numbers with optional text comments. One pair of numbers must be given per line and the file can contain comment lines which are preceded with a '#' character. For example:

# A set of ranges in a range file.
 12      23      
  4      5       This is an optional comment.
 67      10348   Another comment.

Range files are specified on the command line by preceding the filename with @. For example, for the range file RangeFileName:

@RangeFileName

In cases where the numbers are sequence positions, the upper and lower bounds will in practice depend on the length of the sequence to which they are applied. You should bear in mind that sequence positions can be negative, in which case they count back from the end of the sequence.

6.2.2.2.3. `regexp`

EMBOSS uses the "Perl-Compatible Regular Expression Library" (PCRE) release 4.3 to process regular expressions, so any regular expression that is valid in Perl 5.0 (http://search.cpan.org/~nwclark/perl-5.8.7/pod/perlre.pod) should be valid here.

6.2.3. Input ACD Datatypes

The input datatypes cater for input of sequences, sequence features, files and directories, inputs specific to EMBASSY packages (e.g. phylipnew), data files and other files of biological data.

6.2.3.1. Sequence Input

Input datatypes for handling biological sequences include:

sequence: A single sequence for reading
seqall: A set of single sequences that are addressed one after another
seqset: A set of single sequences that can be used all at the same time
seqsetall: One or more sets of single sequences that can be used all at the same time

The data value in all cases is the Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”) of one or more sequences. The USA might specify a literal sequence, database reference, file or some other sequence reference.

6.2.3.1.1. `sequence`

The data value is the USA of a single sequence.

6.2.3.1.2. `seqall`

The data value is the USA of a set of sequences to be read one at a time. For example, the USA might specify a sequence database for sequential reading of entries.

6.2.3.1.3. `seqset`

The data value is the USA of a set of single sequences. For example, a set of sequences from a multiple alignment file, or sequences from a database.

6.2.3.1.4. `seqsetall`

The data value is the USA of one or more sets of single sequences. For example, sets of sequences from two databases or two alignment files. The data value would typically be a listfile: a file containing a list of USAs (see Section 6.6, “The Uniform Sequence Address (USA)”).

6.2.3.2. Feature Input

There is a single datatype for handling biological sequence features input:

features: Sequence feature annotation in any known feature format

6.2.3.2.1. `features`

The data value is the name of a features file. A features file contains sequence feature information. Several feature formats are supported (Section A.2, “Supported Feature Formats”).

6.2.3.3. Files and Directories

Input datatypes for handling general files and directories include:

directory: A directory that can be used for input or output
dirlist: A list of file names that are read from a directory
filelist: A list of input files
infile: Non-sequence-related input file

6.2.3.3.1. `directory`

The data value is the name of any valid directory. For example:

"."
"/data"
"/data/sequences"

6.2.3.3.2. `dirlist`

The data value is the name of any valid directory. For example:

"."
"/data"
"/data/sequences"

6.2.3.3.3. `filelist`

The data value is a list of file names separated by commas. For example:

"../data/file1.dat, file2.dat"

6.2.3.3.4. `infile`

The data value is the name of an input file. For example:

"data.in"
"/data/infile.1"

infile is used for files of data not catered for by some other ACD datatype. For example, an infile would not normally contain sequence data.

6.2.3.4. Data Files

Input datatypes for handling data files include:

datafile: A formatted data file read from the standard EMBOSS data search path
matrix: Comparison matrix file (integer values)
matrixf: Comparison matrix file (floating point values)

In all cases, the data value is the name of a file in the EMBOSS data search path (Section 2.8, “Maintenance”).

Typically where a comparison matrix is specified, gap penalties will also be required. These must be specified separately in one or more other data definitions (see the EMBOSS Developers Guide). The matrix files distributed with BLAST are also distributed with EMBOSS in the EMBOSS data directory.

6.2.3.4.1. `datafile`

The data value is the name of a data file. Many data files already have their own ACD datatype, for example, matrix, matrixf and codon. Other data files do not have or need their own ACD definition and datafile is used for these.

6.2.3.4.2. `matrix`

The data value is the name of an integer comparison matrix file. Applications using integer matrices are usually faster than those using floating point matrices.

6.2.3.4.3. `matrixf`

The data value is the name of a floating point comparison matrix file in the EMBOSS data search path (Section 2.8, “Maintenance”).

The matrixf datatype defines floating point matrices, which usually involve slower calculation times than integer matrices. An integer matrix file can of course also be read as floating point.

6.2.3.5. Datatypes for phylipnew EMBASSY Package

Input datatypes specific to the phylipnew EMBASSY package are given below. These provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

discretestates: Discrete states file
distances: Distance matrix
frequencies: Frequency value(s)
properties: Property value(s)
tree: Phylogenetic tree

6.2.3.5.1. `discretestates`

The data value is the name of a discrete states file and is used by the phylip "discrete character" applications.

6.2.3.5.2. `distances`

The data value is the name of a distances file as used by the phylip "distance matrix" applications.

6.2.3.5.3. `frequencies`

The data value is the name of a frequencies file as used by the phylip "gene frequency and continuous character" applications.

6.2.3.5.4. `properties`

The data value is the name of a properties file as used by the phylip applications to define weights, ancestral states and factors (multi-state characters).

6.2.3.5.5. `tree`

The data value is the name of a tree file and is used as input to the phylip applications to define one or more phylogenetic trees.

6.2.3.6. Other Biological Inputs

Other biological input datatypes include:

codon: Codon usage table file
cpdb: Protein coordinate data in a simple file format (clean coordinate file)
scop: SCOP and CATH domain classification information in a simple file format (domain classification file)

6.2.3.6.1. `codon`

The data value is the name of a codon usage table file in the EMBOSS data search path (Section 2.8, “Maintenance”).

Codon usage files are distributed in the EMBOSS data directory. They are ASCII text files and can be read in several formats.

6.2.3.6.2. `cpdb`

The data value is the name of a CCF file.

CCF (clean coordinate file) format is a simple "clean" file format for protein and domain coordinate data. See the documentation for pdbparse, part of the EMBASSY domainatrix package, which generates CCF files from PDB file input.

6.2.3.6.3. `scop`

The data value is the name of a DCF file.

DCF (domain classification file) format is a simple "clean" file format for domain classification data. See the documentation for domainer, part of the EMBASSY domainatrix package, which generates DCF files from SCOP and CATH file input.

6.2.3.6.4. `pattern`

The standard IUPAC one-letter codes for the amino acids and nucleotides are used. The symbol x is used for a position where any amino acid is accepted. The symbol n is used for a position where any nucleotide is accepted.

Ambiguities are indicated by listing the acceptable amino acids or bases for a given position, between square parentheses [ ]. For example:

[ALT]

stands, in the case of proteins, for Ala or Leu or Thr.

Ambiguities are also indicated by listing between a pair of curly brackets { } the amino acids or bases that are not accepted at a given position. For example:

{AM}

stands, in the case of proteins, for any amino acid except Ala and Met.

Each element in a pattern is separated from its neighbor by a '-' (dash). Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. For example:

x(3) corresponds to x-x-x

x(2,4) corresponds to x-x or x-x-x or x-x-x-x

When a pattern is restricted to either the N- or C-terminal (5' or 3') of a sequence, that pattern either starts with a '<' (reverse chevron) symbol or respectively ends with a '>' (forward chevron) symbol. A period ends the pattern (in most cases optionally). For example:

[DE](2)HS{P}X(2)PX(2,4)C.

6.2.4. Output ACD Datatypes

The output datatypes cater for output of sequences, sequence features, alignments, files and directories, outputs specific to EMBASSY packages (e.g. phylipnew), data files, other files of biological data and formatted application output files (reports).

6.2.4.1. Sequence Output

Output datatypes for handling biological sequences include:

seqout: Output file for single sequence
seqoutall: Output file for multiple sequences
seqoutset: A set of single sequences stored in memory together, to be written to a file

The behaviour of these datatypes is identical but they provided for consistency with the input sequence datatypes (see above).

The data value in all cases is the USA (Section 6.6, “The Uniform Sequence Address (USA)”) of an output sequence stream. FASTA format is used by default for the output sequence(s). The format is normally set at the command line but a default may be hard-coded with osformat: in an ACD file.

6.2.4.1.1. `seqout`

The data value is a USA for a single output sequence, for example, the name of a file.

6.2.4.1.2. `seqoutall`

The data value is a USA for multiple output sequences, for example, the name of a file.

6.2.4.1.3. `seqoutset`

The data value is a USA for multiple output sequences stored as a set in memory together, to be written to file.

6.2.4.2. Features

There is a single output datatype for handling biological sequence features:

featout: Output file for sequence feature annotation

6.2.4.2.1. `featout`

The data value is any valid file name. The data is stored as a feature table. Most common sequence feature formats are supported (Section A.2, “Supported Feature Formats”).

GFF format is used by default for the output feature(s). The format is normally set at the command line but a default may be hard-coded in the ACD file using the offormat: attribute.

6.2.4.3. Alignments

There is a single output datatype for handling alignments:

align: Output file for sequence alignments

6.2.4.3.1. `align`

An alignment output file is defined in the same way as a plain output file (outfile datatype) but has extra qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) to allow a choice of alignment formats and attributes. These can specify whether the alignment will have 2 or more sequences (which limits the possible formats).

The data value is any valid file name. The data is stored as sequences and all of the common alignment formats are supported (Section A.3, “Supported Alignment Formats”).

6.2.4.4. Output Files and Directories

Output datatypes for handling general files and directories of files include:

outdir: Output directory for the writing of multiple output files
outfile: General output file
outfileall: Multiple general output files

outfile and outfileall are used for data not catered for by some other output ACD datatype. For example, the output file would not normally contain sequence data. They are suitable for general application output in plain text.

6.2.4.4.1. `outdir`

The data value is the name of any valid directory. For example:

"."
"/data"
"/data/sequences"

6.2.4.4.2. `outfile`

The data value is the name of an output file.

6.2.4.4.3. `outfileall`

The data value is the base file name for multiple output files.

6.2.4.5. Output Data Files

Output datatypes for handling data files include:

outdata: Output file for data formatted cleanly as a table or list
outmatrix: Output file for integer comparison matrix data
outmatrixf: Output file for floating point comparison matrix data

In all cases the data value is any valid file name.

6.2.4.5.1. `outdata`

The output corresponding to multiple outdata definitions in an ACD file are appended to a single file. The individual ACD definitions allow the format of each file section to be defined.

6.2.4.5.2. `outmatrix`

The data value is the name of an integer substitution matrix in the EMBOSS data search path (Section 2.8, “Maintenance”).

6.2.4.5.3. `outmatrixf`

The data value is the name of a floating point substitution matrix in the EMBOSS data search path (Section 2.8, “Maintenance”).

6.2.4.6. Datatypes for phylipnew EMBASSY Package

Output datatypes specific to the phylipnew EMBASSY package are given below. By defining specific ACD datatypes for phylipnew EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

outdiscrete: Output file for phylogenetics discrete characteristics data
outdistance: Output file for phylogenetics distance matrix data
outfreq: Output file for phylogenetics character frequency data
outproperties: Output file for phylogenetics property data
outtree: Output file for phylogenetic tree data

In all cases, the data value is any valid file name.

6.2.4.6.1. `outdiscrete`

The data value is a name for the discrete states output file.

6.2.4.6.2. `outdistance`

The data value is a name for the distances output file.

6.2.4.6.3. `outfreq`

The data value is a name for the frequencies output file.

6.2.4.6.4. `outproperties`

The data value is a name for the properties output file.

6.2.4.6.5. `outtree`

The data value is a name for the tree output file.

6.2.4.7. Other Biological Outputs

Other biological output datatypes include:

outcodon: Output file for codon usage data
outcpdb: Output file for protein coordinate data in CCF (clean coordinate file) format
outscop: Output file for SCOP and CATH domain classification information in DCF (domain classification file) format

The data value is any valid file name.

6.2.4.7.1. `outcodon`

The data value is a name for the codon usage output file.

The data is stored as a codon usage table. Codon usage table files are ASCII text files and can be written in several formats.

6.2.4.7.2. `outcpdb`

The data value is a name for the CCF output file.

6.2.4.7.3. `outscop`

The data value is a name for the DCF output file.

6.2.4.8. Report Output

The datatype for handling formatted application output is:

report: Output file for sequence annotation

6.2.4.8.1. `report`

The data value is any valid file name.

Report data is stored internally as a feature table, so the available formats (Section A.4, “Supported Report Formats”) include the most common feature formats.

A report file is defined in the same way as a plain output file (Outfile) but has extra qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) to allow a choice of report formats.

6.2.5. Selection ACD Datatypes

Two datatypes cater for menus. In either case, you'll be presented with a limited list of options, each with a label and descriptive text, to choose from.

list: A list of options (typically terse text descriptions) with text labels
selection: A list of options (typically verbose text descriptions) with automatically-generated numerical labels

The data value is one (or more) of the valid options. An option is specified by label (whether text or numerical) or by a non-ambiguous part of the descriptive text itself given after the label. If multiple selections are allowed, you must supply a comma-separated list of options.

6.2.5.1. `list`

Here is the prompt for a list datatype:

Translation frames

   1     1
   2     2
   3     3
   F     Forward three frames
  -1    -1
  -2    -2
  -3    -3
   R     Reverse three frames
   6     All three frames

Frame(s) to translate[1]:

Assuming a single selection only is allowed, these are all valid selections:

"1"
"F"
"Forward"
"For"
"R"
"Reverse"
"Rev"

6.2.5.2. `selection`

Here is prompt for a selection datatype:

Directories to ignore
1        None
2        AAINDEX
3        CVS
4        CODONS
5        PRINTS
6        PROSITE
7        REBASE

Select directories{3, 5, 6]:

Assuming multiple selections are allowed then here are some valid selections:

"3,5,6"
"3"
"CVS"
"5"
"PRINTS"
"PRI"

6.2.6. Graphics ACD Datatypes

The graphics datatypes cater for graphical output:

graph: Graphical output of any general kind
xygraph: Graphical output as a simple two dimensional (2D) XY plot with the sequence along the x-axis

The data value is the graphics device, as limited by the PLPLOT graphics library currently used by EMBOSS. The currently supported devices include ps for Postscript, png for PNG files, and X11 for X-Windows. A value of ? in answer to the prompt will list the available graphics devices on your installation:

"ps"
"png"
"X11"
"gif"
"ps"
"cps"
"?"

6.2.6.1. `graph`

The data value is the graphics device for a general graph. dotplots may be generated with the graph datatype.

6.2.6.2. `xygraph`

The data value is the graphics device for a 2D graph.

Prev	Up	Next
6.1. Introduction to the EMBOSS Command Line	Home	6.3. Global Command Line Qualifiers