cons

## Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

## Function

Create a consensus sequence from a multiple alignment

## Description

cons calculates a consensus sequence from a multiple sequence alignment. To obtain the consensus, the sequence weights and a scoring matrix are used to calculate a score for each amino acid residue or nucleotide at each position in the alignment. The highest scoring residue goes into the consensus sequence if the score is higher than a user-specified "plurality" value, otherwise, there is no consensus at that position.

## Algorithm

To obtain the consensus, the sequence weights and a scoring matrix are used to calculate a score at each position in the alignment as follows. The residue (or nucleotide) i in an alignment column, is compared to all other residues (j) in the same column. The score for i is the sum over all residues j (not i=j) of the score(ij)*weight(j), where score(ij) is taken from a nucleotide or protein scoring matrix (see -datafile qualifier) and the "weight(j)" is the weighting given to the sequence j, which is given in the alignment file.

q

The highest scoring type of residue is then found in the column. If the number of "positive matches" (see below) for this residue is greater than the "plurality value" (see below), then this residue is the consensus residue. Otherwise there is no consensus for that position and an 'n' (nucleotide sequence alignment) or an 'x' (protein sequence alignment) character is written to the consensus sequence.

The positive matches for a residue i are calculated as being the sum of the corresponding sequence weights for all the residues that increase the score of residue i (i.e. that have a positive score). The "plurality" qualifier sets the cut-off for the number of positive matches (weighted) below which there is no consensus.

## Usage

Here is a sample session with cons

 ``` % cons Create a consensus sequence from a multiple alignment Input (aligned) sequence set: dna.msf output sequence [dna.fasta]: aligned.cons ```

## Command line arguments

 ```Create a consensus sequence from a multiple alignment Version: EMBOSS:6.4.0.0 Standard (Mandatory) qualifiers: [-sequence] seqset File containing a sequence alignment. [-outseq] seqout [.] Sequence filename and optional format (output USA) Additional (Optional) qualifiers: -datafile matrix [EBLOSUM62 for protein, EDNAFULL for DNA] This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. -plurality float [Half the total sequence weighting] Set a cut-off for the number of positive matches below which there is no consensus. The default plurality is taken as half the total weight of all the sequences in the alignment. (Any numeric value) -identity integer [0] Provides the facility of setting the required number of identities at a site for it to give a consensus at that position. Therefore, if this is set to the number of sequences in the alignment only columns of identities contribute to the consensus. (Integer 0 or more) -setcase float [@( \$(sequence.totweight) / 2)] Sets the threshold for the positive matches above which the consensus is is upper-case and below which the consensus is in lower-case. (Any numeric value) -name string Name of the consensus sequence (Any string) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outseq" associated qualifiers -osformat2 string Output seq format -osextension2 string File name extension -osname2 string Base file name -osdirectory2 string Output directory -osdbname2 string Database name to add -ossingle2 boolean Separate file for each entry -oufo2 string UFO features -offormat2 string Features format -ofname2 string Features file name -ofdirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit ```

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-sequence]
(Parameter 1)
seqset File containing a sequence alignment. Readable set of sequences Required
[-outseq]
(Parameter 2)
seqout Sequence filename and optional format (output USA) Writeable sequence <*>.format
Additional (Optional) qualifiers
-datafile matrix This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
-plurality float Set a cut-off for the number of positive matches below which there is no consensus. The default plurality is taken as half the total weight of all the sequences in the alignment. Any numeric value Half the total sequence weighting
-identity integer Provides the facility of setting the required number of identities at a site for it to give a consensus at that position. Therefore, if this is set to the number of sequences in the alignment only columns of identities contribute to the consensus. Integer 0 or more 0
-setcase float Sets the threshold for the positive matches above which the consensus is is upper-case and below which the consensus is in lower-case. Any numeric value @( \$(sequence.totweight) / 2)
-name string Name of the consensus sequence Any string
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-sequence" associated seqset qualifiers
-sbegin1
-sbegin_sequence
integer Start of each sequence to be used Any integer value 0
-send1
-send_sequence
integer End of each sequence to be used Any integer value 0
-sreverse1
-sreverse_sequence
boolean Reverse (if DNA) Boolean value Yes/No N
-sask1
-sask_sequence
boolean Ask for begin/end/reverse Boolean value Yes/No N
-snucleotide1
-snucleotide_sequence
boolean Sequence is nucleotide Boolean value Yes/No N
-sprotein1
-sprotein_sequence
boolean Sequence is protein Boolean value Yes/No N
-slower1
-slower_sequence
boolean Make lower case Boolean value Yes/No N
-supper1
-supper_sequence
boolean Make upper case Boolean value Yes/No N
-sformat1
-sformat_sequence
string Input sequence format Any string
-sdbname1
-sdbname_sequence
string Database name Any string
-sid1
-sid_sequence
string Entryname Any string
-ufo1
-ufo_sequence
string UFO features Any string
-fformat1
-fformat_sequence
string Features format Any string
-fopenfile1
-fopenfile_sequence
string Features file name Any string
"-outseq" associated seqout qualifiers
-osformat2
-osformat_outseq
string Output seq format Any string
-osextension2
-osextension_outseq
string File name extension Any string
-osname2
-osname_outseq
string Base file name Any string
-osdirectory2
-osdirectory_outseq
string Output directory Any string
-osdbname2
-osdbname_outseq
string Database name to add Any string
-ossingle2
-ossingle_outseq
boolean Separate file for each entry Boolean value Yes/No N
-oufo2
-oufo_outseq
string UFO features Any string
-offormat2
-offormat_outseq
string Features format Any string
-ofname2
-ofname_outseq
string Features file name Any string
-ofdirectory2
-ofdirectory_outseq
string Output directory Any string
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

## Input file format

The USA of a set of aligned sequences.

### File: dna.msf

 ```!!NA_MULTIPLE_ALIGNMENT dna.msf MSF: 120 Type: N January 01, 1776 12:00 Check: 3196 .. Name: MSFM1 Len: 120 Check: 8587 Weight: 1.00 Name: MSFM2 Len: 120 Check: 6178 Weight: 1.00 Name: MSFM3 Len: 120 Check: 8431 Weight: 1.00 // MSFM1 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC MSFM2 ACGTACGTAC GTACGTACGT ....ACGTAC GTACGTACGT ACGTACGTAC MSFM3 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT CGTACGTACG MSFM1 GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT MSFM2 GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT MSFM3 TACGTACGTA CGTACGTACG TACGTACGTA ACGTACGTAC GTACGTACGT MSFM1 ACGTACGTAC GTACGTACGT MSFM2 ACGTACGTTG CAACGTACGT MSFM3 ACGTACGTAC GTACGTACGT ```

## Output file format

The output consists of a sequence file holding the consensus sequence.

### File: aligned.cons

 ```>EMBOSS_001 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ```

## Data files

cons uses the standard set of scoring matrix data files in the EMBOSS data directory.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

```% embossdata -showall
```

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:

```
% embossdata -fetch -file Exxx.dat

```

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

• . (your current directory)
• .embossdata (under your current directory)
• ~/ (your home directory)
• ~/.embossdata

## Notes

The "identity" qualifier provides an additional constrain to "plurality" when determining a consensus residue at an alignment site. "identity" sets the required number of identities at a site for it to be included in the consensus. If for example this is set to the number of sequences in the alignment, then only a site with the same residue in all sequences would be included in the consensus.

The "setcase" qualifier sets the threshold for the positive matches above which the consensus residue is given is upper-case and below which it is in lower-case.

None.

None.

None.

## Exit status

It always exits with status 0.

None.

## See also

Program name Description
consambig Create an ambiguous consensus sequence from a multiple alignment
megamerger Merge two large overlapping DNA sequences
merger Merge two overlapping sequences

## Author(s)

Tim Carver formerly at:
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.

## Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

None