A sequence format defines the permitted layout and content of text in a file. This includes text tokens that define fields used in a databank. These fields include the sequence itself, the sequence identifier name and accession number, amongst others. Non-printable control characters are not generally used, allowing most formats to be viewed on screen or printed out.
The FASTA format is a very widely used (and abused) format. It consists of a header line starting with a >
character followed by a code identifying the sequence and, very often, some text describing the sequence. The header line is followed by one or more lines containing the sequence itself. FASTA files may contain one or more sequences:
>crab_anapl ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). MDITIHNPLIRRPLFSWLAPSRIFDQIFGEHLQESELLPASPSLSPFLMR SPIFRMPSWLETGLSEMRLEKDKFSVNLDVKHFSPEELKVKVLGDMVEIH GKHEERQDEHGFIAREFNRKYRIPADVDPLTITSSLSLDGVLTVSAPRKQ SDVPERSIPITREEKPAIAGAQRK >crab_bovin ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). MDIAIHHPWIRRPFFPFHSPSRLFDQFFGEHLLESDLFPASTSLSPFYLR PPSFLRAPSWIDTGLSEMRLEKDRFSVNLDVKHFSPEELKVKVLGDVIEV HGKHEERQDEHGFISREFHRKYRIPADVDPLAITSSLSSDGVLTVNGPRK QASGPERTIPITREEKPAVTAAPKK >crab_chick ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). MDITIHNPLVRRPLFSWLTPSRIFDQIFGEHLQESELLPTSPSLSPFLMR SPFFRMPSWLETGLSEMRLEKDKFSVNLDVKHFSPEELKVKVLGDMIEIH GKHEERQDEHGFIAREFSRKYRIPADVDPLTITSSLSLDGVLTVSAPRKQ SDVPERSIPITREEKPAIAGSQRK
Beyond FASTA, the most widespread sequence formats are those used by the major sequence databases:
Sadly, sequences are occasionally stored in non-standard formats. These include proprietary word processor formats (e.g. MS Word and MS WordPad) and text formatting languages (e.g. PostScript, PDF, RTF, TeX and HTML). EMBOSS will not read a sequence in any of these formats.
If you have a sequence in a non-standard format you should:
Save the sequence to a file as plain ASCII text, without any formatting whatsoever. The file should contain the sequence only. EMBOSS will recognise this "plain" format. The program you are using to view the file should have an option to "Save as..." plain text.
If there is not an option to save your sequence in plain text format directly, there may well be a utility program to convert the file to plain text format. The EMBOSS user community will be able to help you with this (see Section 3.5, “How to Get Help”).
Use a text editor that is capable of writing files in plain text format in the future. These include pico, nedit, emacs and MS wordpad. When using a text editor to create a sequence file, the best (simplest) format to use is FASTA as described above. Be sure to save your sequence as plain text.
If you intend to manipulate or edit the sequences substantially, investigate using a full-blown sequence editor such as mse. Such editors should have an option to save the sequence to a file in one or more of the standard formats.
Some sequence formats can hold multiple sequences in one file. Typically there will be multiple entries (one per sequence) that are catenated in the file. Other formats, such as Staden, can only hold one sequence per file. An attempt to catenate several such sequences in one file would result in a mess from which it would be difficult to differentiate the sequences from the annotation. Most systems including EMBOSS will not parse such files, therefore you should never use a single sequence format to hold multiple sequences. Sequences are also held in alignment files. These contain the results of aligning (lining up similar or equivalent characters) in two or more sequences. EMBOSS supports most common sequence alignments formats (Section A.3, “Supported Alignment Formats”).
All of the common sequence formats are supported in EMBOSS for both application input (reading) and output (writing). These are summarised below. Some support single sequences only, some multiple sequences. The names of the sequence formats are taken from common EMBOSS database configurations. Some of these are obviously synonyms e.g. "embl" and "em". In practice, the names available will depend on what's defined in your EMBOSS configuration files (see Section 2.8, “Maintenance”). For descriptions and examples of the supported formats see Section A.1, “Supported Sequence Formats”.
The supported sequence formats are summarised in the table below. The columns are as follows: Input format (format name), Output format (format name), Sngl (indicates whether each sequence is written to a new file. This behaviour is the default and can be set by the -ossingle
command line qualifier. Save (indicates that sequence data is stored internally and written when the output is closed. This is needed for 'interleaved' formats such as Phylip and MSF), Try (indicates whether the format can be detected automatically on input), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Feat (whether the format includes feature annotation data. EMBOSS can also read feature data from a separate feature file). Gap (whether the format supports sequence data with gap characters, for example the results of an alignment), Mset ("true" indicates that more than one set of sequences can be stored in a single file. This is used by, for example, phylogenetic analysis applications to store many versions of a multiple alignment for statistical analysis) and Description (short description of the format).
Input Format | Try | Nuc | Pro | Feat | Gap | Mset | Description |
---|---|---|---|---|---|---|---|
abi | Yes | Yes | Yes | No | Yes | No | ABI trace file |
acedb | Yes | Yes | Yes | No | Yes | No | ACEDB sequence format |
clustal | Yes | Yes | Yes | No | Yes | No | Clustalw output format |
codata | Yes | Yes | Yes | Yes | Yes | No | Codata entry format |
dbid | No | Yes | Yes | No | Yes | No | Fasta format variant with database name before ID |
embl | Yes | Yes | No | Yes | Yes | No | EMBL entry format |
experiment | Yes | Yes | Yes | No | Yes | No | Staden experiment file |
fasta | Yes | Yes | Yes | No | Yes | No | FASTA format including NCBI-style IDs |
fastq | Yes | Yes | No | No | No | No | FASTQ short read format ignoring quality scores |
fastq-illumina | No | Yes | No | No | No | No | FASTQ Illumina 1.3 short read format |
fastq-sanger | No | Yes | No | No | No | No | FASTQ short read format with phred quality |
fastq-solexa | No | Yes | No | No | No | No | FASTQ Solexa/Illumina 1.0 short read format |
fitch | Yes | Yes | Yes | No | Yes | No | Fitch program format |
gcg | Yes | Yes | Yes | No | Yes | No | GCG sequence format |
genbank | Yes | Yes | No | Yes | Yes | No | Genbank entry format |
genpept | No | No | Yes | Yes | Yes | No | Refseq protein entry format (alias) |
gff2 | Yes | Yes | Yes | Yes | Yes | No | GFF feature file with sequence in the header |
gff3 | Yes | Yes | Yes | Yes | Yes | No | GFF3 feature file with sequence |
gifasta | No | Yes | Yes | No | Yes | No | FASTA format including NCBI-style GIs (alias) |
hennig86 | Yes | Yes | Yes | No | Yes | No | Hennig86 output format |
ig | No | Yes | Yes | No | Yes | No | Intelligenetics sequence format |
igstrict | Yes | Yes | Yes | No | Yes | No | Intelligenetics sequence format strict parser |
jackknifer | Yes | Yes | Yes | No | Yes | No | Jackknifer interleaved and non-interleaved formats |
mase | No | Yes | Yes | No | Yes | No | Mase program format |
mega | Yes | Yes | Yes | No | Yes | No | Mega interleaved and non-interleaved formats |
msf | Yes | Yes | Yes | No | Yes | No | GCG MSF (multiple sequence file) file format |
nbrf | Yes | Yes | Yes | Yes | Yes | No | NBRF/PIR entry format |
nexus | Yes | Yes | Yes | No | Yes | No | Nexus/paup interleaved format |
pdb | Yes | No | Yes | No | No | No | PDB protein databank format ATOM lines |
pdbnuc | No | Yes | No | No | No | No | PDB protein databank format nucleotide ATOM lines |
pdbnucseq | No | Yes | No | No | No | No | PDB protein databank format nucleotide SEQRES lines |
pdbseq | Yes | No | Yes | No | No | No | PDB protein databank format SEQRES lines |
pearson | Yes | Yes | Yes | No | Yes | No | Plain old fasta format with IDs not parsed further |
phylip | Yes | Yes | Yes | No | Yes | Yes | Phylip interleaved and non-interleaved formats |
phylipnon | No | Yes | Yes | No | Yes | Yes | Phylip non-interleaved format |
raw | Yes | Yes | Yes | No | No | No | Raw sequence with no non-sequence characters |
refseqp | No | No | Yes | Yes | Yes | No | Refseq protein entry format |
selex | No | Yes | Yes | No | Yes | No | Selex format |
staden | No | Yes | Yes | No | Yes | No | Old staden package sequence format |
stockholm | Yes | Yes | Yes | No | Yes | No | Stockholm (pfam) format |
strider | Yes | Yes | Yes | No | Yes | No | DNA strider output format |
swiss | Yes | No | Yes | Yes | Yes | No | Swissprot entry format |
text | No | Yes | Yes | No | Yes | No | Plain text |
treecon | Yes | Yes | Yes | No | Yes | No | Treecon output format |
Output Format | Sngl | Save | Nuc | Pro | Feat | Gap | Mset | Description |
---|---|---|---|---|---|---|---|---|
acedb | No | No | Yes | Yes | No | Yes | No | ACEDB sequence format |
asn1 | No | No | Yes | Yes | No | Yes | No | NCBI ASN.1 format |
clustal | No | Yes | Yes | Yes | No | Yes | No | Clustalw multiple alignment format |
codata | No | No | Yes | Yes | No | Yes | No | Codata entry format |
das | No | No | Yes | Yes | No | Yes | No | DASSEQUENCE DAS any sequence |
dasdna | No | No | Yes | No | No | Yes | No | DASDNA DAS nucleotide-only sequence |
debug | No | No | Yes | Yes | No | Yes | No | Debugging trace of full internal data content |
embl | No | No | Yes | No | Yes | Yes | No | EMBL entry format |
experiment | No | No | Yes | Yes | No | Yes | No | Staden experiment file |
fasta | No | No | Yes | Yes | No | Yes | No | FASTA format |
fastq-illumina | No | No | Yes | No | No | No | No | FASTQ Illumina 1.3 short read format |
fastq-sanger | No | No | Yes | No | No | No | No | FASTQ short read format with phred quality |
fastq-solexa | No | No | Yes | No | No | No | No | FASTQ Solexa/Illumina 1.0 short read format |
fitch | No | No | Yes | Yes | No | Yes | No | Fitch program format |
gcg | No | No | Yes | Yes | No | Yes | No | GCG sequence format |
genbank | No | No | Yes | No | No | Yes | No | Genbank entry format |
gff2 | No | No | Yes | Yes | Yes | Yes | No | GFF2 feature file with sequence in the header |
gff3 | No | No | Yes | Yes | Yes | Yes | No | GFF3 feature file with sequence in FASTA format after |
gifasta | No | No | Yes | Yes | No | Yes | No | NCBI fasta format with NCBI-style IDs using GI number |
hennig86 | No | Yes | Yes | Yes | No | Yes | No | Hennig86 output format |
ig | No | No | Yes | Yes | No | Yes | No | Intelligenetics sequence format |
jackknifer | No | Yes | Yes | Yes | No | Yes | No | Jackknifer output interleaved format |
jackknifernon | No | Yes | Yes | Yes | No | Yes | No | Jackknifer output non-interleaved format |
mase | No | No | Yes | Yes | No | Yes | No | Mase program format |
mega | No | Yes | Yes | Yes | No | Yes | No | Mega interleaved output format |
meganon | No | Yes | Yes | Yes | No | Yes | No | Mega non-interleaved output format |
msf | No | Yes | Yes | Yes | No | Yes | No | GCG MSF (multiple sequence file) file format |
nbrf | No | No | Yes | Yes | Yes | Yes | No | NBRF/PIR entry format |
ncbi | No | No | Yes | Yes | No | Yes | No | NCBI fasta format with NCBI-style IDs |
nexus | No | Yes | Yes | Yes | No | Yes | No | Nexus/paup interleaved format |
nexusnon | No | Yes | Yes | Yes | No | Yes | No | Nexus/paup non-interleaved format |
phylip | No | Yes | Yes | Yes | No | Yes | Yes | Phylip interleaved format |
phylipnon | No | Yes | Yes | Yes | No | Yes | No | Phylip non-interleaved format |
selex | No | Yes | Yes | Yes | No | Yes | No | Selex format |
staden | No | No | Yes | Yes | No | Yes | No | Old staden package sequence format |
strider | No | No | Yes | Yes | No | Yes | No | DNA strider output format |
swiss | No | No | No | Yes | Yes | Yes | No | Swissprot entry format |
text | No | No | Yes | Yes | No | Yes | No | Plain text |
treecon | No | Yes | Yes | Yes | No | Yes | No | Treecon output format |
An entry in a sequence databank will typically include a code and other information to identify the sequence, some bibliographic information, sequence annotation including a description of any features and, of course, the sequence itself.
An excerpt of the EMBL entry for a beta-glucosidase mRNA sequence is shown below:
ID X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP. XX AC X56734; S46826; XX DT 12-SEP-1991 (Rel. 29, Created) DT 25-NOV-2005 (Rel. 85, Last updated, Version 11) XX DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase XX KW beta-glucosidase. XX OS Trifolium repens (white clover) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium. XX RN [5] RP 1-1859 RX PUBMED; 1907511. RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; RT "Nucleotide and derived amino acid sequence of the cyanogenic RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)"; RL Plant Mol. Biol. 17(2):209-219(1991). XX RN [6] RP 1-1859 RA Hughes M.A.; RT ; RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases. RL Hughes M.A., University of Newcastle Upon Tyne, Medical School, Newcastle RL Upon Tyne, NE2 4HH, UK XX FH Key Location/Qualifiers FH FT source 1..1859 FT /organism="Trifolium repens" FT /mol_type="mRNA" FT /clone_lib="lambda gt10" FT /clone="TRE361" FT /tissue_type="leaves" FT /db_xref="taxon:3899" FT CDS 14..1495 FT /product="beta-glucosidase" FT /EC_number="3.2.1.21" FT /note="non-cyanogenic" FT /db_xref="GOA:P26204" FT /db_xref="HSSP:P26205" FT /db_xref="InterPro:IPR001360" FT /db_xref="UniProtKB/Swiss-Prot:P26204" FT /protein_id="CAA40058.1" FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD" FT mRNA 1..1859 FT /experiment="experimental evidence, no additional details FT recorded" XX SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60 cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120 tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180 . . sequence omitted for brevity . aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc 1740 agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac 1800 tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859 //
An entry in a database must have some way of being uniquely identified. Most sequence databases have two such identifiers for each sequence - an ID name and an accession number.
Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the life of the database. If two sequences are merged, then the new sequence will get a new accession number and the accession numbers of the merged sequences will be retained as 'secondary' accession numbers. EMBL, GenBank and Swissprot share an accession numbering scheme - an accession number uniquely identifies a sequence within these three databases. In contrast, ID names are not guaranteed to remain the same between different versions of a database, although in practice they usually do.
Why are there two such identifiers? The ID name was originally intended to be a human-readable name that indicate the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example hsfau
is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the accession numbers started to be also assigned as the ID name. Therefore you will now find ID names like AF061303
are the same as the accession number for that sequence in EMBL.
Most sequence formats include an identifier code in some form or another. Typically this is an accession number and/or identifier name (ID) and is given near the top of the entry. They uniquely identify an entry in the database.
For our EMBL entry, the accession number X56734
is given on the ID
line and separately in the AC
line:
ID X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP. XX AC X56734; S46826; XX
In contrast, FASTA format often gives the ID as the first word of an informative title line:
>IDName
An Informative comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt
Most sequence formats have records for bibliographic information such as literature references, experimental details, author contact information, cross-links to other databases, and much more besides. In the example below, the date of release (DT
,) a description (DE
), keywords (KW
), organism species (OS
), organism classification (OC
) and reference information (RN
, RP
, RX
, RA
, RT
and RL
) are given:
DT 12-SEP-1991 (Rel. 29, Created) DT 25-NOV-2005 (Rel. 85, Last updated, Version 11) XX DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase XX KW beta-glucosidase. XX OS Trifolium repens (white clover) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium. XX RN [5] RP 1-1859 RX PUBMED; 1907511. RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; RT "Nucleotide and derived amino acid sequence of the cyanogenic RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)"; RL Plant Mol. Biol. 17(2):209-219(1991). XX RN [6] RP 1-1859 RA Hughes M.A.; RT ; RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases. RL Hughes M.A., University of Newcastle Upon Tyne, Medical School, Newcastle RL Upon Tyne, NE2 4HH, UK XX
Most sequence formats have records for descriptions, annotations and comments provided with the sequence. Molecular features associated with the sequence, such as protein secondary structure or molecular recognition sites, are kept in a feature table. These are marked up by FT
records in the EMBL entry below.
XX FH Key Location/Qualifiers FH FT source 1..1859 FT /organism="Trifolium repens" FT /mol_type="mRNA" FT /clone_lib="lambda gt10" FT /clone="TRE361" FT /tissue_type="leaves" FT /db_xref="taxon:3899" FT CDS 14..1495 FT /product="beta-glucosidase" FT /EC_number="3.2.1.21" FT /note="non-cyanogenic" FT /db_xref="GOA:P26204" FT /db_xref="HSSP:P26205" FT /db_xref="InterPro:IPR001360" FT /db_xref="UniProtKB/Swiss-Prot:P26204" FT /protein_id="CAA40058.1" FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD" FT mRNA 1..1859 FT /experiment="experimental evidence, no additional details FT recorded" XX
Further information on sequence features is available (Section A.2, “Supported Feature Formats”).
Sequences are usually represented in IUBMB standard one-letter codes (see http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html). There are exceptions, for example Staden format uses non-standard ambiguity codes. In the case of FASTA format the sequence is anything after the '>
' line until the next entry starts. For other databases, records are used to delineate the sequence.
In EMBL entries, an SQ
label is used to identify the sequence (the full sequence is not given):
XX SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60 cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120 tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180 . . sequence omitted for brevity . aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc 1740 agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac 1800 tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859 //
Sequences are referred to on the EMBOSS command line by their Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”). This is a standard sequence naming scheme used by all EMBOSS applications. A USA specifies one or more sequences that might be read from or written to a file or to a larger databank. Other sequence sources such as an applications or web servers can also be specified.
There are also a set of command line qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) that are used for sequence input and output. These allow you to set such things as file format, sequence regions, database and entry names.
For example, the format of an output sequence may be set by on the command line as follows:
seqret seq.in seq.out -osformat |
... or by giving it in the USA of the output filename:
seqret seq.in embl::seq.out |
Most of the EMBOSS applications are for sequence manipulation. The generic sequence-handling applications are summarised in the table (???).
Application | Description |
---|---|
backtranseq | Backtranslate a protein sequence |
compseq | Count composition of dimer/trimer/etc words in a sequence |
cutseq | Removes a specified section from a sequence |
degapseq | Removes gap characters from sequences |
descseq | Alter the name or description of a sequence |
diffseq | Find differences between nearly identical sequences |
extractseq | Extract regions from a sequence |
infoseq | Displays some simple information about sequences |
maskseq | Mask off regions of a sequence |
newseq | Type in a short new sequence |
notseq | Exclude a set of sequences and write out the remaining ones |
nthseq | Writes one sequence from a multiple set of sequences |
pasteseq | Insert one sequence into another |
prettyseq | Output sequence with translated ranges |
revseq | Reverse and complement a sequence |
seqmatchall | All-against-all comparison of a set of sequences |
seqret | Reads and writes (returns) sequences |
seqretsplit | Reads and writes (returns) sequences in individual files |
showseq | Display a sequence with features, translation etc |
shuffleseq | Shuffles a set of sequences maintaining composition |
skipseq | Reads and writes (returns) sequences, skipping first few |
transeq | Translate nucleic acid sequences |
trimseq | Trim ambiguous bits off the ends of sequences |