A.1. Supported Sequence Formats

A.1.1. ABI Trace

File format produced by ABI sequencing machines. It contains the 'trace data' which includes the probabilities of the four nucleotide bases along the sequencing run together with the sequence deduced from that data. The sequence information is what is normally read in and used by EMBOSS programs, although the trace data is available and may be utilised by some specialised EMBOSS programs. The code for parsing this format is heavily based on David Mathog's Fortran library, which is bundled with a description of ABI trace file format (abi.txt):

ftp://saf.bio.caltech.edu/pub/software/molbio/abitools.zip

ABI trace is a binary format so an example is not given here.

A.1.2. ACEDB

File format used by the AceDB database. AceDB is a genome database designed for the flexible handling of bioinformatic data. It includes tools designed to manipulate genomic data, but is increasingly also used for non-biological data. For further information see:

http://www.acedb.org/
DNA : "HSFAU"
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc
gccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggt
gaccggccaggaaacggtcgcccagatcaaggctcatgtagcctcactgg
agggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctg
gaggatgaggccactctgggccagtgcggggtggaggccctgactaccct
ggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaag
aagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcg
ctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgcca
actcttaagtcttttgtaattctggctttctctaataaaaaagccactta
gttcagtcaaaaaaaaaa

A.1.3. ASN1

ASN.1, or Abstract Syntax Notation One, is an International Standards Organization (ISO) format used to achieve interoperability between platforms. ASN.1 is used for the storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, and MEDLINE records. The EMBOSS format ASN1 is a subset of ASN.1 containing the entry name, accession number, description and sequence. It is similar to the current ASN.1 output of the readseq application. For further information see:

http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html
  seq {
    id { local id 1 },
    descr { title "" },
    inst {
      repr raw, mol aa, length 131, topology linear,
 {
      seq-data
        iupacaa "TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQATGGWKTCSGTCT
TSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFAPTLMSSCITSTTGPPAWAGDRSHE"
      } } ,
  seq {
    id { local id 1 },
    descr { title "" },
    inst {
      repr raw, mol aa, length 131, topology linear,
 {
      seq-data
        iupacaa "TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQATGGWKTCSGTCT
TSTSTRHRGRSGW----------RASRKSMRAACSRSAGSRPNRFAPTLMSSCITSTTGPPAWAGDRSHE"
      } } ,
  seq {
    id { local id 1 },
    descr { title "" },
    inst {
      repr raw, mol aa, length 131, topology linear,
 {
      seq-data
        iupacaa "TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQATGGWKTCSGTCT
TSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--GSRPPRFAPPLMSSCITSTTGPPPPAGDRSHE"
      } } ,
  seq {
    id { local id 1 },
    descr { title "" },
    inst {
      repr raw, mol aa, length 131, topology linear,
 {
      seq-data
        iupacaa "TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQATGGYKTCSGTCT
TSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--GSRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE"
      } } ,

A.1.4. Asis

This is not a file format as such. It is a method used by EMBOSS as an alternative to specifying a filename or other sequence reference on the command line. The sequence is used directly (i.e. as-[it-]is), for example:

asis:"AALRNTY"

The method is part of the Uniform Sequence Address (USA) specification (Section 6.6, “The Uniform Sequence Address (USA)”) used to specify sequences on the command line.

asis::MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

A.1.5. Clustal

Clustal multiple sequence alignment format. This is also supported for input and output of non-aligned sequences. Clustalw is a widely used program for multiple sequence alignment. For further information see:

http://www.clustal.org/
CLUSTAL W(1.4) multiple sequence alignment


IXI_234         TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235         TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236         TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237         TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT
                                                                  

IXI_234         GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
IXI_235         GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAG
IXI_236         GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--G
IXI_237         GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--G
                                                                  

IXI_234         SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235         SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236         SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237         SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE

A.1.6. CODATA

CODATA format used by various information systems and software tools at the Lawrence Berkeley National Laboratory (LBNL). For further information see:

http://merrill.olm.net/mdocs/seedis/codata.html

Sequence files in CODATA format may contain multiple sequences. A sequence entry begins with a line with the text ENTRY followed by the sequence ID. A line with SEQUENCE followed by a sequence numbering line marks the start of the sequence, given on the next and subsequent lines. A sequence entry ends with a line containing '///' only.

ENTRY           IXI_234 
SEQUENCE        
                 5        10        15        20        25        30
      1 T S P A S I R P P A G P S S R P A M V S S R R T R P S P P G
     31 P R R P T G R P C C S A A P R R P Q A T G G W K T C S G T C
     61 T T S T S T R H R G R S G W S A R T T T A A C L R A S R K S
     91 M R A A C S R S A G S R P N R F A P T L M S S C I T S T T G
    121 P P A W A G D R S H E
///
ENTRY           IXI_235 
SEQUENCE        
                 5        10        15        20        25        30
      1 T S P A S I R P P A G P S S R - - - - - - - - - R P S P P G
     31 P R R P T G R P C C S A A P R R P Q A T G G W K T C S G T C
     61 T T S T S T R H R G R S G W - - - - - - - - - - R A S R K S
     91 M R A A C S R S A G S R P N R F A P T L M S S C I T S T T G
    121 P P A W A G D R S H E
///
ENTRY           IXI_236 
SEQUENCE        
                 5        10        15        20        25        30
      1 T S P A S I R P P A G P S S R P A M V S S R - - R P S P P P
     31 P R R P P G R P C C S A A P P R P Q A T G G W K T C S G T C
     61 T T S T S T R H R G R S G W S A R T T T A A C L R A S R K S
     91 M R A A C S R - - G S R P P R F A P P L M S S C I T S T T G
    121 P P P P A G D R S H E
///
ENTRY           IXI_237 
SEQUENCE        
                 5        10        15        20        25        30
      1 T S P A S L R P P A G P S S R P A M V S S R R - R P S P P G
     31 P R R P T - - - - C S A A P R R P Q A T G G Y K T C S G T C
     61 T T S T S T R H R G R S G Y S A R T T T A A C L R A S R K S
     91 M R A A C S R - - G S R P N R F A P T L M S S C L T S T T G
    121 P P A Y A G D R S H E
///

A.1.7. DAS

DAS format is for output only. It conforms to the output of a DAS Distributed Annotation System version 1.53 annotation server. DAS is an XML format for sequence data. For more information see:

http://www.acedb.org/
<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASSEQUENCE SYSTEM "http://www.biodas.org/dtd/dassequence.dtd">
<DASSEQUENCE>
  <SEQUENCE id="X65923" start="1" stop="518"
               moltype="DNA" version="X65923.1">
      ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc
      gccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggt
      gaccggccaggaaacggtcgcccagatcaaggctcatgtagcctcactgg
      agggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctg
      gaggatgaggccactctgggccagtgcggggtggaggccctgactaccct
      ggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
      gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaag
      aagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcg
      ctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgcca
      actcttaagtcttttgtaattctggctttctctaataaaaaagccactta
      gttcagtcaaaaaaaaaa
  </SEQUENCE>
</DASSEQUENCE>

A.1.8. DAS DNA

DAS format is for output only. It conforms to the output of a DAS Distributed Annotation System version 1.53 sequence server. DASDNA is an XML format for sequence data. For more information see:

http://www.acedb.org/
<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASDNA SYSTEM "http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
  <SEQUENCE id="X65923" start="1" stop="518" version="X65923.1">
    <DNA length="518">
      ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc
      gccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggt
      gaccggccaggaaacggtcgcccagatcaaggctcatgtagcctcactgg
      agggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctg
      gaggatgaggccactctgggccagtgcggggtggaggccctgactaccct
      ggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
      gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaag
      aagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcg
      ctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgcca
      actcttaagtcttttgtaattctggctttctctaataaaaaagccactta
      gttcagtcaaaaaaaaaa
    </DNA>
  </SEQUENCE>
</DASDNA>

A.1.9. Debug

Debug format is for debugging purposes only. All elements in the internal data structures used to hold a sequence are printed out, although not all fields in the output file will contain data. The data generated depends very much on the input format used. No example is given below. For further information see the EMBOSS Developers Guide.

A.1.10. EMBL

EMBL entry format, or at least a minimal subset of the fields. EMBL is the public nucleotide sequence database and includes all publicly available DNA sequences with their annotation. EMBL is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the European Molecular Biology Laboratory (EMBL), DNA DataBank of Japan (DDBJ), and GenBank at the National Center for Biotechnology Information. Daily sharing of data is the basis of the collaboration. The Staden package and many others use EMBL or similar formats for sequence data.

An EMBL entry includes a code to identify the sequence, the sequence itself and a table of features of biological interest such as coding regions with their protein translations, repeats and functional sites. Bibliographic information such as literature references, experimental details, author contact information, cross-links to other databases is included. For further information see:

http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html

Where EMBL format is used for input, the EMBL format will be kept in step with the latest version of the database although data in older EMBL formats will be still parsable.

Where EMBL format is used for output, fields for which data are available will be completed and others with no information are omitted. Exactly what data will be present depends very much on the source of input sequences. The EMBOSS command line allows data, such as accession numbers, to be provided if they do not form part of the input sequence data (see Section 6.4, “Datatype-specific Command Line Qualifiers”).

ID   X65923; SV 1; linear; mRNA; STD; HUM; 518 BP.
XX
AC   X65923;
XX
DT   13-MAY-1992 (Rel. 31, Created)
DT   18-APR-2005 (Rel. 83, Last updated, Version 11)
XX
DE   H.sapiens fau mRNA
XX
KW   fau gene.
XX
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;
OC   Homo.
XX
RN   [1]
RP   1-518
RA   Michiels L.M.R.;
RT   ;
RL   Submitted (29-APR-1992) to the EMBL/GenBank/DDBJ databases.
RL   L.M.R. Michiels, University of Antwerp, Dept of Biochemistry,
RL   Universiteisplein 1, 2610 Wilrijk, BELGIUM
XX
RN   [2]
RP   1-518
RX   PUBMED; 8395683.
RA   Michiels L., Van der Rauwelaert E., Van Hasselt F., Kas K., Merregaert J.;
RT   "fau cDNA encodes a ubiquitin-like-S30 fusion protein and is expressed as
RT   an antisense sequence in the Finkel-Biskis-Reilly murine sarcoma virus";
RL   Oncogene 8(9):2537-2546(1993).
XX
DR   H-InvDB; HIT000322806.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..518
FT                   /organism="Homo sapiens"
FT                   /chromosome="11q"
FT                   /map="13"
FT                   /mol_type="mRNA"
FT                   /clone_lib="cDNA"
FT                   /clone="pUIA 631"
FT                   /tissue_type="placenta"
FT                   /db_xref="taxon:9606"
FT   misc_feature    57..278
FT                   /note="ubiquitin like part"
FT   CDS             57..458
FT                   /gene="fau"
FT                   /db_xref="GDB:135476"
FT                   /db_xref="GOA:P62861"
FT                   /db_xref="HGNC:3597"
FT                   /db_xref="HSSP:1GJZ"
FT                   /db_xref="InterPro:IPR006846"
FT                   /db_xref="UniProtKB/Swiss-Prot:P35544"
FT                   /protein_id="CAA46716.1"
FT                   /translation="MQLFVRAQELHTFEVTGQETVAQIKAHVASLEGIAPEDQVVLLAG
FT                   APLEDEATLGQCGVEALTTLEVAGRMLGGKVHGSLARAGKVRGQTPKVAKQEKKKKKTG
FT                   RAKRRMQYNRRFVNVVPTFGKKKGPNANS"
FT   misc_feature    98..102
FT                   /note="nucleolar localization signal"
FT   misc_feature    279..458
FT                   /note="S30 part"
FT   polyA_signal    484..489
FT   polyA_site      509
XX
SQ   Sequence 518 BP; 125 A; 139 C; 148 G; 106 T; 0 other;
     ttcctctttc tcgactccat cttcgcggta gctgggaccg ccgttcagtc gccaatatgc        60
     agctctttgt ccgcgcccag gagctacaca ccttcgaggt gaccggccag gaaacggtcg       120
     cccagatcaa ggctcatgta gcctcactgg agggcattgc cccggaagat caagtcgtgc       180
     tcctggcagg cgcgcccctg gaggatgagg ccactctggg ccagtgcggg gtggaggccc       240
     tgactaccct ggaagtagca ggccgcatgc ttggaggtaa agttcatggt tccctggccc       300
     gtgctggaaa agtgagaggt cagactccta aggtggccaa acaggagaag aagaagaaga       360
     agacaggtcg ggctaagcgg cggatgcagt acaaccggcg ctttgtcaac gttgtgccca       420
     cctttggcaa gaagaagggc cccaatgcca actcttaagt cttttgtaat tctggctttc       480
     tctaataaaa aagccactta gttcagtcaa aaaaaaaa                               518
//

A.1.11. Experiment (Staden)

The format used by the latest version of the Staden package. Staden is package of programs for sequence handling and analysis that is particularly useful for analysis of sequence trace data and large scale sequence assembly projects. For further information see:

http://staden.sourceforge.net/
ID   xb63c7.s1
EN   xb63c7.s1
LN   xb63c7.s1.ztr
LT   ZTR
QR   440
AQ   29.960000
AV   42 42 42 42 42 42 42 25 27 31 34 33 38 39 40 40 40 38 37 37 37
AV        38 38 38 38 37 37 38 38 38 38 37 38 38 39 39 39 38 37 38
AV        38 39 39 39 39 39 39 40 40 40 39 39 39 40 39 39 39 39 39
AV        39 37 37 37 37 37 36 36 37 36 36 36 34 34 34 34 37 37 37
AV        37 37 37 37 37 36 34 36 36 36 36 37 34 32 31 32 31 33 33
AV        31 31 33 30 29 29 27 27 30 32 33 31 32 31 31 33 33 32 36
AV        36 36 36 36 36 33 33 34 34 36 34 33 34 32 32 33 31 31 32
AV        29 30 31 30 31 30 31 26 26 30 25 26 27 26 25 26 26 25 27
AV        27 22 22 26 26 25 27 27 27 30 27 29 30 30 27 27 29 29 29
AV        29 29 30 29 29 29 30 30 29 29 30 24 29 30 30 30 30 30 31
AV        30 31 32 30 33 34 34 34 36 34 34 33 33 33 33 34 27 26 27
AV        26 26 25 26 27 27 27 29 30 29 29 29 34 36 30 31 30 24 25
AV        25 24 24 25 25 25 22 22 22 22 27 27 27 33 34 34 34 33 33
AV        33 33 36 36 34 34 32 33 30 29 29 30 31 31 30 29 29 29 29
AV        29 29 29 29 32 34 34 32 29 27 30 30 30 29 29 29 29 30 29
AV        29 23 23 25 26 26 25 25 24 26 27 26 25 25 26 26 24 22 22
AV        23 24 24 24 25 24 24 24 24 25 21 21 25 27 27 27 27 27 23
AV        22 22 22 22 23 23 27 22 21 22 21 22 22 21 22 23 24 24 21
AV        22 22 17 21 22 21 21 19 20 17 19 18 18 18 19 19 16 19 16
AV        16 15 16 16 13 16 14 13 12 12 13 12 11 4 11 10 10 10 10
AV        4 4 11 11 11 10 10 10 12 12 11 11 10 9 9 8 8 8 4 4 8 9
AV        9 4 9 9 9 10 12 13 12 13 4 16 15 17 17 4 17 15 18 19 19
AV        20 17 19 19 20 16 4 39 4 39 39 39 4 39
SQ   
     CAGGTTCGAC TCTAGAGGAT CCCCTGAAAT ATTAAAACTA AAATGTGTAT AATAAAAATT
     GTATACCAAT TTCAGTGATA AATAATTTAT TTTATAGAAA AAAGAAGAAC AAAGCTGATG
     ATTAAAACTG AACTCGATTT TCTGATTGGA AGAACTTGTA CCAATCGATG ATATGAGATG
     TTAAAAACTG GAATTGATAT TTAACCGATT GAACCTGAAT GAAAAACAAC GGACCTGAAA
     ATTAAATTAT TATTTTAAAT TGACATTTTG AAAATTTCCC CCGTAATTTT ATTGCAATTT
     TAATTGAAAG TTTATTAATT GTGAAATGTG CTTTTTAAGA TGTTGCAAAC ACCTAATTAC
     TATTTTCACT TTTGAG-TAT GT--ATTTTC TAAATAACTT --GGT-TGAT TTCC-AATT-
     TAATTTTCAA AAG-CCA-G
//
SF   /home6/jkb/work/course/t/m13mp18.vector
CF   /home6/jkb/work/course/t/lorist6.vector
TN   xb63c7
PR   1
SC   6249
SP   41
SI   1400..2000
CH   0
QR   377
QL   0
SL   24
SR   440

A.1.12. FASTA

Standard FASTA format. FASTA is probably the most widely used of all sequence formats and may hold multiple sequences. It was originally defined as sequence query input to the FASTA homology search program. For further information see:

http://fasta.bioch.virginia.edu/

A sequence entry has a one-line header line followed by one or more lines of sequence data. The header line begins with the '>' character. The next word on the line is interpreted as sequence-specific code, such as a database identifier (ID) or accession number. The rest of the header line is (typically) a description of the sequence. Some versions use control-A to mark line breaks in the description. These are ignored by EMBOSS and most other applications. The sequence itself is given on the lines following which are recommended to be shorter than 80 characters. Blank lines are ignored.

In practice there are many different styles of identifier and description in use. EMBOSS supports the following styles in which ID is the sequence identifier, Accession is the accession number, given (optionally) after the ID, and Description is the rest of the header line e.g.:

>ID Description
>ID Accession Description

Sequence output specified to be in fasta format will use the 'FASTA (with accession)' variant, shown below.

A.1.13. FASTA (GCG)

GCG-style FASTA format. An optional database name (Database) may be included as part of the sequence identifier:

>Database:ID Accession Description

>embl:X65923 X65923.1 H.sapiens fau mRNA
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

A.1.14. FASTA (Pearson)

This is standard FASTA format where the identifier is taken as-is without any parsing. This allows users to keep the original identifier in cases where EMBOSS would normally interpret it using one of the other FASTA styles. To use this format, you must explicitly specify "pearson" as the sequence format.

>gnl|em|HSFAU X65923 H.sapiens fau mRNA
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

A.1.15. FASTA (with accession)

This is standard FASTA format but with the accession number or sequence version included after the identifier. The first word of the description is accepted as an accession number and/or sequence version if it matches a recognized format, for example EMBL/GenBank or SwissProt accession numbers.

The format is detected with other FASTA formats on input, and the accession number is automatically added by EMBOSS where it is available.

>X65923 X65923.1 H.sapiens fau mRNA
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

A.1.16. FASTA (database and identifier)

This is a derivative FASTA format with the database name given first in the description line, followed (optionally) by the accession number:

>Database Description
>Database Accession Description

The format is detected with other FASTA formats on input, but cannot be specified for output.

>embl:X65923 X65923.1 H.sapiens fau mRNA
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

A.1.17. FASTA (GI style)

Same as FASTA (NCBI style) except that the sequence GI code is given instead of the entry ID in the description line. The description line contains the entry GI, database name, and optional accession number separated by pipe ('|') characters:

>gi|12345|gnl|Database|Accession Description

The format is detected with other FASTA formats on input, and can be specified as "gifasta" for output.

>gi|31302|gnl|genbank|X65923 (X65923.1) H.sapiens fau mRNA.
TTCCTCTTTCTCGACTCCATCTTCGCGGTAGCTGGGACCGCCGTTCAGTCGCCAATATGC
AGCTCTTTGTCCGCGCCCAGGAGCTACACACCTTCGAGGTGACCGGCCAGGAAACGGTCG
CCCAGATCAAGGCTCATGTAGCCTCACTGGAGGGCATTGCCCCGGAAGATCAAGTCGTGC
TCCTGGCAGGCGCGCCCCTGGAGGATGAGGCCACTCTGGGCCAGTGCGGGGTGGAGGCCC
TGACTACCCTGGAAGTAGCAGGCCGCATGCTTGGAGGTAAAGTTCATGGTTCCCTGGCCC
GTGCTGGAAAAGTGAGAGGTCAGACTCCTAAGGTGGCCAAACAGGAGAAGAAGAAGAAGA
AGACAGGTCGGGCTAAGCGGCGGATGCAGTACAACCGGCGCTTTGTCAACGTTGTGCCCA
CCTTTGGCAAGAAGAAGGGCCCCAATGCCAACTCTTAAGTCTTTTGTAATTCTGGCTTTC
TCTAATAAAAAAGCCACTTAGTTCAGTCAAAAAAAAAA

A.1.18. FASTA (NCBI style)

FASTA format with an NCBI-style description line with the database name, entry ID and optional accession or sequence version number separated by pipe ('|') characters in the description line. NCBI has a very limited vocabulary of approved database names which can be used is exactly that database name was used or specified with the -osdbname qualifier.

>gnl|Database|Accession Description
>Database|Seqversion|ID Description

The format is detected with other FASTA formats on input, and can be specified as "ncbi" for output.

There are many variants on this theme. If you find one that does not appear to be supported please let the EMBOSS developers know.

>emb|X65923.1|X65923 H.sapiens fau mRNA
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

A.1.19. Fastq

Fastq is a new format created for the storing of very large number of short reads from next-generation sequencing instruments. There are a number of variants of fastq format which have to be explicitly named on the command line as there is no reliable method to detect the precise format automatically. These formats are described in detail below. All the variations are in the formatting of the quality scores, so EMBOSS also supports fastq as a simple sequence format, ignoring the quality scores, on input. When used on output, fastq is an alias for the fastq-sanger format as this is the one most commonly used.

Sequence output specified to be in fastq format will use the 'FASTQ (Sanger) style' variant, shown below.

A.1.20. Fastq (Illumina)

Fastq-illumina supports the Illumina 1.3 standard for representing phred quality scores encoded as one character per base. This format must be specified explicitly on the command line as data files are also valid as fastq-sanger format which allows all possible characters as quality scores.

@FASTQ-ILL100R:1:2:3:4#0/1
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTA
+
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@

A.1.21. Fastq (Sanger)

Fastq-sanger supports the Sanger Institute standard for representing phred quality scores encoded as one character per base. This format must be specified explicitly on the command line as data files are also valid as fastq-illumina or fastq-solexa format which accept many of the same characters as quality scores.

As an output format this is the current EMBOSS default if "fastq" is specified.

@FASTQ-SAN100R:1:2:3:4#0/1
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC
+
~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!

A.1.22. Fastq (Solexa)

Fastq-solexa supports the Illumina 1.0 or Solexa standard for representing solexa quality scores encoded as one character per base. This format must be specified explicitly on the command line as data files are also valid as fastq-sanger format which allows all possible characters as quality scores.

@FASTQ-SLX100R:1:2:3:4#0/1
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC
+
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;

A.1.23. Fitch

Fitch is an old format, not in common use, for phylogenetic analysis by Walter Fitch's programs. It is one of the formats originally adopted from the readseq program.

X65923, 518 bases
 ttc ctc ttt ctc gac tcc atc ttc gcg gta gct ggg acc gcc gtt cag tcg cca ata tgc
 agc tct ttg tcc gcg ccc agg agc tac aca cct tcg agg tga ccg gcc agg aaa cgg tcg
 ccc aga tca agg ctc atg tag cct cac tgg agg gca ttg ccc cgg aag atc aag tcg tgc
 tcc tgg cag gcg cgc ccc tgg agg atg agg cca ctc tgg gcc agt gcg ggg tgg agg ccc
 tga cta ccc tgg aag tag cag gcc gca tgc ttg gag gta aag ttc atg gtt ccc tgg ccc
 gtg ctg gaa aag tga gag gtc aga ctc cta agg tgg cca aac agg aga aga aga aga aga
 aga cag gtc ggg cta agc ggc gga tgc agt aca acc ggc gct ttg tca acg ttg tgc cca
 cct ttg gca aga aga agg gcc cca atg cca act ctt aag tct ttt gta att ctg gct ttc
 tct aat aaa aaa gcc act tag ttc agt caa aaa aaa aa

A.1.24. GCG 8, GCG 9.x and 10.x

GCG format was used by the Accelrys GCG, formerly known as the GCG Wisconsin package. GCG was a commercial software package of programs and utilities for gene and protein analysis. For further information see:

http://www.accelrys.com/products/gcg/

A sequence file in GCG format must contain a single sequence only. Such files begin with one or more description lines with informative text about the file contents. The start of the sequence is marked by a line ending with two dot (..) characters. This line typically also gives the sequence length, the date when the file was created, the type of the sequence and the GCG checksum. The dots delimit the descriptive information from the sequence data that follows. In GCG 9.x and GCG 10.x formats, the format and sequence type are identified on the first line of the file. In GCG 8.x format, anything up to the first line containing .. is considered as heading, and the remainder is sequence data.

When GCG format is specified as the input, EMBOSS will first assume the later format (GCG 9.x and 10.x) before trying with the GCG 8 format. When GCG format is specified for output, the later format will be generated regardless of whether GCG or GCG8 was specified.

Latest format:

!!NA_SEQUENCE 1.0

H.sapiens fau mRNA

HSFAU  Length: 518  Type: N  Check: 2981 ..

   1 ttcctctttc tcgactccat cttcgcggta gctgggaccg ccgttcagtc

  51 gccaatatgc agctctttgt ccgcgcccag gagctacaca ccttcgaggt

 101 gaccggccag gaaacggtcg cccagatcaa ggctcatgta gcctcactgg

 151 agggcattgc cccggaagat caagtcgtgc tcctggcagg cgcgcccctg

 201 gaggatgagg ccactctggg ccagtgcggg gtggaggccc tgactaccct

 251 ggaagtagca ggccgcatgc ttggaggtaa agttcatggt tccctggccc

 301 gtgctggaaa agtgagaggt cagactccta aggtggccaa acaggagaag

 351 aagaagaaga agacaggtcg ggctaagcgg cggatgcagt acaaccggcg

 401 ctttgtcaac gttgtgccca cctttggcaa gaagaagggc cccaatgcca

 451 actcttaagt cttttgtaat tctggctttc tctaataaaa aagccactta

 501 gttcagtcaa aaaaaaaa

GCG 8 format:

H.sapiens fau mRNA

HSFAU  Length: 518  Type: N  Check: 2981 ..

   1 ttcctctttc tcgactccat cttcgcggta gctgggaccg ccgttcagtc

  51 gccaatatgc agctctttgt ccgcgcccag gagctacaca ccttcgaggt

 101 gaccggccag gaaacggtcg cccagatcaa ggctcatgta gcctcactgg

 151 agggcattgc cccggaagat caagtcgtgc tcctggcagg cgcgcccctg

 201 gaggatgagg ccactctggg ccagtgcggg gtggaggccc tgactaccct

 251 ggaagtagca ggccgcatgc ttggaggtaa agttcatggt tccctggccc

 301 gtgctggaaa agtgagaggt cagactccta aggtggccaa acaggagaag

 351 aagaagaaga agacaggtcg ggctaagcgg cggatgcagt acaaccggcg

 401 ctttgtcaac gttgtgccca cctttggcaa gaagaagggc cccaatgcca

 451 actcttaagt cttttgtaat tctggctttc tctaataaaa aagccactta

 501 gttcagtcaa aaaaaaaa

A.1.25. GenBank

GenBank entry format supports all the fields in the latest database format. GENBANK is part of the International Nucleotide Sequence Database Collaboration and uses the same content as the EMBL and DDBJ databases. The format is described in greater detail at:

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

Where GenBank format is used for output, fields for which data are available will be completed and others with no information will be omitted. Exactly what data will be present depends very much on the source of input sequences. The EMBOSS command line allows data, such as accession numbers, to be provided if they do not form part of the input sequence data (see Section 6.4, “Datatype-specific Command Line Qualifiers”).

LOCUS       X65923                   518 bp    mRNA    linear   PRI 18-APR-2005
DEFINITION  H.sapiens fau mRNA.
ACCESSION   X65923
VERSION     X65923.1  GI:31302
KEYWORDS    fau gene.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 518)
  AUTHORS   Michiels,L., Van der Rauwelaert,E., Van Hasselt,F., Kas,K. and
            Merregaert,J.
  TITLE     fau cDNA encodes a ubiquitin-like-S30 fusion protein and is
            expressed as an antisense sequence in the Finkel-Biskis-Reilly
            murine sarcoma virus
  JOURNAL   Oncogene 8 (9), 2537-2546 (1993)
   PUBMED   8395683
REFERENCE   2  (bases 1 to 518)
  AUTHORS   Michiels,L.M.R.
  TITLE     Direct Submission
  JOURNAL   Submitted (29-APR-1992) L.M.R. Michiels, University of Antwerp,
            Dept of Biochemistry, Universiteisplein 1, 2610 Wilrijk, BELGIUM
FEATURES             Location/Qualifiers
     source          1. .518
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:9606"
                     /chromosome="11q"
                     /map="13"
                     /clone="pUIA 631"
                     /tissue_type="placenta"
                     /clone_lib="cDNA"
     gene            1. .518
                     /gene="fau"
     CDS             57. .458
                     /gene="fau"
                     /codon_start=1
                     /protein_id="CAA46716.1"
                     /db_xref="GI:31303"
                     /db_xref="GDB:135476"
                     /db_xref="GOA:P35544"
                     /db_xref="GOA:P62861"
                     /db_xref="GOA:Q05472"
                     /db_xref="HGNC:3597"
                     /db_xref="UniProtKB/Swiss-Prot:P35544"
                     /db_xref="UniProtKB/Swiss-Prot:P62861"
                     /translation="MQLFVRAQELHTFEVTGQETVAQIKAHVASLEGIAPEDQVVLLA
                     GAPLEDEATLGQCGVEALTTLEVAGRMLGGKVHGSLARAGKVRGQTPKVAKQEKKKKK
                     TGRAKRRMQYNRRFVNVVPTFGKKKGPNANS"
     misc_feature    57. .278
                     /gene="fau"
                     /note="ubiquitin like part"
     misc_feature    98. .102
                     /gene="fau"
                     /note="nucleolar localization signal"
     misc_feature    279. .458
                     /gene="fau"
                     /note="S30 part"
     polyA_signal    484. .489
                     /gene="fau"
     polyA_site      509
                     /gene="fau"
ORIGIN
       1  TTCCTCTTTC TCGACTCCAT CTTCGCGGTA GCTGGGACCG CCGTTCAGTC GCCAATATGC
      61  AGCTCTTTGT CCGCGCCCAG GAGCTACACA CCTTCGAGGT GACCGGCCAG GAAACGGTCG
     121  CCCAGATCAA GGCTCATGTA GCCTCACTGG AGGGCATTGC CCCGGAAGAT CAAGTCGTGC
     181  TCCTGGCAGG CGCGCCCCTG GAGGATGAGG CCACTCTGGG CCAGTGCGGG GTGGAGGCCC
     241  TGACTACCCT GGAAGTAGCA GGCCGCATGC TTGGAGGTAA AGTTCATGGT TCCCTGGCCC
     301  GTGCTGGAAA AGTGAGAGGT CAGACTCCTA AGGTGGCCAA ACAGGAGAAG AAGAAGAAGA
     361  AGACAGGTCG GGCTAAGCGG CGGATGCAGT ACAACCGGCG CTTTGTCAAC GTTGTGCCCA
     421  CCTTTGGCAA GAAGAAGGGC CCCAATGCCA ACTCTTAAGT CTTTTGTAAT TCTGGCTTTC
     481  TCTAATAAAA AAGCCACTTA GTTCAGTCAA AAAAAAAA
//

A.1.26. GenPept

GenPept entry format supports all the fields in the latest database format. GENPEPT is an automatic translation of the GenBank. The database is available from:

ftp://ftp.ncifcrf.gov/pub/genpept

Where GenPept format is used for output, fields for which data are available will be completed and others with no information will be omitted. Exactly what data will be present depends very much on the source of input sequences. The EMBOSS command line allows data, such as accession numbers, to be provided if they do not form part of the input sequence data (see Section 6.4, “Datatype-specific Command Line Qualifiers”).

GenPept currently uses the same parser as the closely related RefseqP format so these can be used interchangeably until the original formats diverge.

LOCUS       CAA46716                 133 aa            linear   PRI 18-APR-2005
DEFINITION  fau [Homo sapiens].
ACCESSION   CAA46716
VERSION     CAA46716.1  GI:31303
DBSOURCE    embl accession X65923.1
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (residues 1 to 133)
  AUTHORS   Michiels,L., Van der Rauwelaert,E., Van Hasselt,F., Kas,K. and
            Merregaert,J.
  TITLE     fau cDNA encodes a ubiquitin-like-S30 fusion protein and is
            expressed as an antisense sequence in the Finkel-Biskis-Reilly
            murine sarcoma virus
  JOURNAL   Oncogene 8 (9), 2537-2546 (1993)
   PUBMED   8395683
REFERENCE   2  (residues 1 to 133)
  AUTHORS   Michiels,L.M.R.
  TITLE     Direct Submission
  JOURNAL   Submitted (29-APR-1992) L.M.R. Michiels, University of Antwerp,
            Dept of Biochemistry, Universiteisplein 1, 2610 Wilrijk, BELGIUM
FEATURES             Location/Qualifiers
     source          1..133
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="11q"
                     /map="13"
                     /clone="pUIA 631"
                     /tissue_type="placenta"
                     /clone_lib="cDNA"
     Protein         1..133
                     /name="fau"
     Region          1..74
                     /region_name="Fubi"
                     /note="Fubi is a ubiquitin-like protein encoded by the fau
                     gene which has an  N-terminal ubiquitin-like domain (also
                     referred to as FUBI) fused to the ribosomal protein S30.
                     Fubi is thought to be a tumor suppressor protein and the
                     FUBI domain may act as a...; cd01793"
                     /db_xref="CDD:29195"
     Region          74..133
                     /region_name="Ribosomal_S30"
                     /note="Ribosomal protein S30; cl02062"
                     /db_xref="CDD:141357"
     CDS             1..133
                     /gene="fau"
                     /coded_by="X65923.1:57..458"
                     /db_xref="GDB:135476"
                     /db_xref="GOA:P35544"
                     /db_xref="GOA:P62861"
                     /db_xref="GOA:Q05472"
                     /db_xref="HGNC:3597"
                     /db_xref="UniProtKB/Swiss-Prot:P35544"
                     /db_xref="UniProtKB/Swiss-Prot:P62861"
ORIGIN      
        1 mqlfvraqel htfevtgqet vaqikahvas legiapedqv vllagapled eatlgqcgve
       61 alttlevagr mlggkvhgsl aragkvrgqt pkvakqekkk kktgrakrrm qynrrfvnvv
      121 ptfgkkkgpn ans
//

A.1.27. GFF3

The General Feature Format version 3 (GFF3) is an extension by Lincoln Stein and the Sequence Ontology project of the GFF2 format developed at the Sanger Institute for describing genes and other features associated with DNA, RNA and protein sequences (see Section A.2, “Supported Feature Formats”). GFF format is normally used to hold pure feature information only, but can hold the sequence immediately after the feature table. A complete specification of the format is available at:

http://www.sequenceontology.org/gff3.shtml
##gff-version 3
##sequence-region HSFAU 1 518
#!Date 2009-07-29
#!Type DNA
#!Source-version EMBOSS 6.1.0
##FASTA
>HSFAU X65923 H.sapiens fau mRNA
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

A.1.28. GFF2

The General Feature Format version 2 (GFF) is a format developed at the Sanger Institute for describing genes and other features associated with DNA, RNA and protein sequences (see Section A.2, “Supported Feature Formats”). GFF format is normally used to hold pure feature information only, but can hold the sequence as part of the structured header. A complete specification of the format is available at:

http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
##gff-version 2
##source-version EMBOSS 6.1.0
##date 2009-07-29
##DNA HSFAU
##ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
##agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
##cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
##tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
##tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
##gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
##agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
##cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
##tctaataaaaaagccacttagttcagtcaaaaaaaaaa
##end-DNA

A.1.29. Hennig86

Hennig86 is used by the Hennig86 package for interactive phylogenetic analysis. It allows phylogenetic trees to be read, written, drawn and analysed, and the most parsimonious trees to be calculated. For further information see:

http://www.cladistics.org/education/hennig86.html
xread
' Written by EMBOSS 29/07/09 '
518 1
X65923
11331311131320313301311323221023122203323321130213233001012302313111213323233302202310303033113202212033223302200032213233302013002231301210233130312202223011233332200201300213212313312230223232333312202201202233031312223302123222212202233312031033312200210230223323012311220221000211301221133312233321231220000212020221302031331002212233000302202002002002002002030221322231002322322012302103003322323111213003211212333033111223002002002223333001233003131100213111121001131223111313100100000023303110211302130000000000
;

A.1.30. Intelligenetics

Intelligenetics format is used by Intelligenetics, an old sequence analysis package which is no longer under development. It is also used by various other packages as it is a relatively simple format. It has the problem that non-sequence file can sometimes look like Intelligenetics files. EMBOSS has a format "igstrict" which requires full compliance with this format. Less strict files, for example files lacking a 1 at the end of the sequence, can be read only if "ig" is specified as the format on the command line.

;H.sapiens fau mRNA, 518 bases
HSFAU
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc
gccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggt
gaccggccaggaaacggtcgcccagatcaaggctcatgtagcctcactgg
agggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctg
gaggatgaggccactctgggccagtgcggggtggaggccctgactaccct
ggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaag
aagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcg
ctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgcca
actcttaagtcttttgtaattctggctttctctaataaaaaagccactta
gttcagtcaaaaaaaaaa1

A.1.31. Jackknifer

Jackknifer format is used by the Parsimony Jackknifer (JAC) or PHYSYS phylogenetics package. The format was among those adopted from the readseq package. For further information see:

http://evolution.genetics.washington.edu/phylip/software.old.html
' Written by EMBOSS 29/07/09 
(IXI_234)            TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
(IXI_235)            TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT
(IXI_236)            TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT
(IXI_237)            TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT
(IXI_234)            GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
(IXI_235)            GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAG
(IXI_236)            GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--G
(IXI_237)            GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--G
(IXI_234)            SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
(IXI_235)            SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
(IXI_236)            SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
(IXI_237)            SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE
;

A.1.32. Jackknifer (non-interleaved)

Jackknifernon format is the non-interleaved version of the format used by the Parsimony Jackknifer (JAC) or PHYSYS package for jackknifed sequences. The format was among those adopted from the readseq package. For further information see:

http://evolution.genetics.washington.edu/phylip/software.old.html
' Written by EMBOSS 29/07/09 
(IXI_234)            TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
(IXI_235)            TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT
GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAG
SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
(IXI_236)            TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT
GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--G
SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
(IXI_237)            TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT
GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--G
SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE
;

A.1.33. MASE

MASE format is used by the SeaView multiple alignment editor. For further information see:

http://pbil.univ-lyon1.fr/software/seaview.html

This style is not automatically detected by EMBOSS as it may accept non-sequence data. However, most Mase format data is accepted using the "igstrict" format.

;;Written by EMBOSS on Wed 29 Jul 2009 10:48:43
;H.sapiens fau mRNA
X65923
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

A.1.34. MEGA

Mega is the format used by the MEGA package. MEGA is an integrated tool for automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, and testing evolutionary hypotheses. For further information see:

http://www.megasoftware.net/
#mega
!Title: Written by EMBOSS 29/07/09;
!Format
    DataType=Protein DataFormat=Interleaved
    Identical=. Indel=- Missing=?
    ;




#IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
#IXI_235 ...............---------..........................
#IXI_236 ......................--.....P....P.........P.....
#IXI_237 .....L.................-...........----...........

#IXI_234 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
#IXI_235 ........................----------................
#IXI_236 ...............................................--.
#IXI_237 ..Y....................Y.......................--.

#IXI_234 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
#IXI_235 ...............................
#IXI_236 ...P....P.............PP.......
#IXI_237 ..............L........Y.......

A.1.35. MEGA (non-interleaved)

Meganon is the non-interleaved version of the format used by the MEGA package. For further information see:

http://www.megasoftware.net/
#mega
!Title: Written by EMBOSS 29/07/09;
!Format
    DataType=Protein
    Identical=. Indel=- Missing=?
    ;

#IXI_234
TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQATGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
#IXI_235
...............---------..................................................----------...............................................
#IXI_236
......................--.....P....P.........P....................................................--....P....P.............PP.......
#IXI_237
.....L.................-...........----.............Y....................Y.......................--...............L........Y.......

A.1.36. MSF

MSF is the format used for multiple sequences by Accelrys GCG, formerly known as the GCG Wisconsin Package. GCG was a commercial software package of programs and utilities for gene and protein analysis. For further information see:

http://www.accelrys.com/products/gcg/
!!AA_MULTIPLE_ALIGNMENT 1.0

  msf MSF:  131 Type: P 22/01/02 CompCheck: 3003 ..

  Name: IXI_234 Len: 131  Check: 6808 Weight: 1.00
  Name: IXI_235 Len: 131  Check: 4032 Weight: 1.00
  Name: IXI_236 Len: 131  Check: 2744 Weight: 1.00
  Name: IXI_237 Len: 131  Check: 9419 Weight: 1.00

//

           1                                               50
IXI_234    TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235    TSPASIRPPAGPSSR.........RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236    TSPASIRPPAGPSSRPAMVSSR..RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237    TSPASLRPPAGPSSRPAMVSSRR.RPSPPGPRRPT....CSAAPRRPQAT

           51                                             100
IXI_234    GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
IXI_235    GGWKTCSGTCTTSTSTRHRGRSGW..........RASRKSMRAACSRSAG
IXI_236    GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR..G
IXI_237    GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR..G

           101                         131
IXI_234    SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235    SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236    SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237    SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE

A.1.37. NBRF / PIR

The NBRF or PIR format is that used in the PIR database, the integrated protein informatics resource for genomic and proteomic research. For further information see:

http://pir.georgetown.edu/pirwww

A file in PIR format may contain multiple sequences. The first line is a header beginning with '>', followed by a two-letter code describing the sequence type (P1 for a complete protein sequence, F1 for a protein fragment, DL for a DNA sequence, DC for a circular DNA sequence, RL, RC, or XX), followed by the sequence database ID code. The second line contains a textual description of the sequence, followed by one or more lines with the sequence itself. The end of the sequence is marked by a '*' (asterisk) character which does not imply a stop codon.

>P1;104K_THEAN
Example protein sequence in NBRF format. The final '*' is ignored.
MKFLVLLFNI LCLFPILGAD ELVMSPIPTT DVQPKVTFDI NSEVSSGPLY LNPVEMAGVK
YLQLQRQPGV QVHKVVEGDI VIWENEEMPL YTCAIVTQNE VPYMAYVELL EDPDLIFFLK
EGDQWAPIPE DQYLARLQQL RQQIHTESFF SLNLSFQHEN YKYEMVSSFQ HSIKMVVFTP
KNGHICKMVY DKNIRIFKAL YNEYVTSVIG FFRGLKLLLL NIFVIDDRGM IGNKYFQLLD
DKYAPISVQG YVATIPKLKD FAEPYHPIIL DISDIDYVNF YLGDATYHDP GFKIVPKTPQ
CITKVVDGNE VIYESSNPSV ECVYKVTYYD KKNESMLRLD LNHSPPSYTS YYAKREGVWV
TSTYIDLEEK IEELQDHRST ELDVMFMSDK DLNVVPLTNG NLEYFMVTPK PHRDIIIVFD
GSEVLWYYEG LENHLVCTWI YVTEGAPRLV HLRVKDRIPQ NTDIYMVKFG EYWVRISKTQ
YTQEIKKLIK KSKKKLPSIE EEDSDKHGGP PKGPEPPTGP GHSSSESKEH EDSKESKEPK
EHGSPKETKE GEVTKKPGPA KEHKPSKIPV YTKRPEFPKK SKSPKRPESP KSPKRPVSPQ
RPVSPKSPKR PESLDIPKSP KRPESPKSPK RPVSPQRPVS PRRPESPKSP KSPKSPKSPK
VPFDPKFKEK LYDSYLDKAA KTKETVTLPP VLPTDESFTH TPIGEPTAEQ PDDIEPIEES
VFIKETGILT EEVKTEDIHS ETGEPEEPKR PDSPTKHSPK PTGTHPSMPK KRRRSDGLAL
STTDLESEAG RILRDPTGKI VTMKRSKSFD DLTTVREKEH MGAEIRKIVV DDDGTEADDE
DTHPSKEKHL STVRRRRPRP KKSSKSSKPR KPDSAFVPSI GIL*

A.1.38. NEXUS / PAUP (interleaved)

The Nexus or PAUP format is used by the PAUP package of tools for inferring and interpreting phylogenetic trees. For further information see:

http://paup.csit.fsu.edu/
#NEXUS
[TITLE: Written by EMBOSS 29/07/09]

begin data;
dimensions ntax=4 nchar=131;
format interleave datatype=protein missing=X gap=-;

matrix
IXI_234              TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235              TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236              TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237              TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT

IXI_234              GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
IXI_235              GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAG
IXI_236              GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--G
IXI_237              GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--G

IXI_234              SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235              SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236              SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237              SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE
;

end;
begin assumptions;
options deftype=unord;
end;

A.1.39. NEXUS / PAUP (non-interleaved)

The NEXUSNON or PAUPNON format is the non-interleaved version of the format used by the PAUP package of tools for inferring and interpreting phylogenetic trees. For further information see:

http://paup.csit.fsu.edu/
#NEXUS
[TITLE: Written by EMBOSS 29/07/09]

begin data;
dimensions ntax=4 nchar=131;
format datatype=protein missing=X gap=-;

matrix
IXI_234
TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQATGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235
TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQATGGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAGSRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236
TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQATGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--GSRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237
TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQATGGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--GSRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE
;

end;
begin assumptions;
options deftype=unord;
end;

A.1.40. PDB

The Protein Data Bank is a collection of 3D structures. These files contain protein sequences for one or more chains in a structure. These are in two forms, either as sequence residues in the SEQRES records (format pdbseq) or as structural elements in the ATOM records (format pdb). EMBOSS reads both versions. If the structure is incomplete the SEQRES records will have the full sequence but parts will be missing from the ATOM records. For further information see:

http://www.wwpdb.org/
HEADER    HORMONE                                 10-MAY-82   2INS      2INS   3
COMPND    DES-*PHE B1 INSULIN                                           2INS   4
SOURCE    BOVINE (BOS TAURUS)                                           2INS   5
AUTHOR    G.D.SMITH,W.L.DUAX,E.J.DODSON,G.G.DODSON,R.A.G.$DE *GRAAF,    2INSB  1
AUTHOR   1 C.D.REYNOLDS                                                 2INS   7
REVDAT   7   31-MAY-84 2INSF   1       REMARK                           2INSF  1
REVDAT   6   31-JAN-84 2INSE   1       REMARK                           2INSE  1
REVDAT   5   27-OCT-83 2INSD   1       REMARK                           2INSD  1
REVDAT   4   30-SEP-83 2INSC   1       REVDAT                           2INSC  1
REVDAT   3   13-JUN-83 2INSB   1       AUTHOR JRNL                      2INSC  2
REVDAT   2   07-MAR-83 2INSA   3       JRNL   REMARK MTRIX              2INSC  3
REVDAT   1   05-AUG-82 2INS    0                                        2INSC  4
JRNL        AUTH   G.D.SMITH,W.L.DUAX,E.J.DODSON,G.G.DODSON,            2INS   8
JRNL        AUTH 2 R.A.G.$DE *GRAAF,C.D.REYNOLDS                        2INSB  2
JRNL        TITL   THE STRUCTURE OF DES-*PHE B1 BOVINE INSULIN          2INS  10
JRNL        REF    ACTA CRYSTALLOGR.,SECT.B      V.  38  3028 1982      2INSA  1
JRNL        REFN   ASTM ACBCAR  DK ISSN 0567-7408                  107  2INSA  2
REMARK   1                                                              2INS  13
REMARK   1 REFERENCE 1                                                  2INSD  2
REMARK   1  AUTH   J.BORDAS,G.G.DODSON,H.GREWE,M.H.J.KOCH,B.KREBS,      2INSD  3
REMARK   1  AUTH 2 J.RANDALL                                            2INSD  4
REMARK   1  TITL   A COMPARATIVE ASSESSMENT OF THE ZINC-PROTEIN         2INSD  5
REMARK   1  TITL 2 COORDINATION IN 2*ZN-INSULIN AS DETERMINED BY X-RAY  2INSD  6
REMARK   1  TITL 3 ABSORPTION FINE STRUCTURE (/EXAFS$) AND X-RAY        2INSD  7
REMARK   1  TITL 4 CRYSTALLOGRAPHY                                      2INSD  8
REMARK   1  REF    PROC.R.SOC.LONDON,SER.B       V. 219    21 1983      2INSD  9
REMARK   1  REFN   ASTM PRLBA4  UK ISSN 0080-4649                  338  2INSE  2

... < data omitted for brevity >

REMARK   1 REFERENCE 14                                                 2INSD 23
REMARK   1  EDIT   M.O.DAYHOFF                                          2INS  89
REMARK   1  REF    ATLAS OF PROTEIN SEQUENCE     V.   5   187 1972      2INS  90
REMARK   1  REF  2 AND STRUCTURE (DATA SECTION)                         2INS  91
REMARK   1  PUBL   NATIONAL BIOMEDICAL RESEARCH FOUNDATION,             2INS  92
REMARK   1  PUBL 2 SILVER SPRING,MD.                                    2INS  93
REMARK   1  REFN                   ISBN 0-912466-02-2              435  2INS  94
REMARK   2                                                              2INS  95
REMARK   2 RESOLUTION. 2.5 ANGSTROMS.                                   2INS  96
REMARK   3                                                              2INS  97
REMARK   3 REFINEMENT. FAST FOURIER LEAST-SQUARES REFINEMENT FOLLOWED   2INS  98
REMARK   3  BY EXAMINATION OF DIFFERENCE MAPS USING THE *MMS-X*         2INS  99
REMARK   3  GRAPHICS SYSTEM.  THE FINAL R IS 0.18 FOR 2128              2INS 100
REMARK   3  REFLECTIONS. THE RMS SHIFT AFTER REGULARIZATION IS 0.16     2INS 101
REMARK   3  ANGSTROMS AND THE DEVIATION OF A BOND FROM ITS IDEAL        2INS 102
REMARK   3  VALUE IS 0.20 ANGSTROMS. MODEL FIT PARAMETERS ARE SIGMA     2INS 103
REMARK   3  (BOND) = 0.02 ANGSTROMS AND SIGMA(ANGLE) = 3.0 DEGREES.     2INS 104
REMARK   4                                                              2INS 105

... < data omitted for brevity >

REMARK   7 BECAUSE THE COORDINATES OF THE SYMMETRY-RELATED ATOMS ARE    2INS 140
REMARK   7 NOT INCLUDED IN THIS ENTRY THE COMPLETE CONNECTIVITY OF      2INS 141
REMARK   7 ATOMS ZN1 AND ZN2 CANNOT BE SPECIFIED.  PARTIAL              2INS 142
REMARK   7 CONNECTIVITY IS GIVEN BY                                     2INS 143
REMARK   7          CONECT  229  227  228  791                          2INS 144
REMARK   7          CONECT  624  622  623  792                          2INS 145
REMARK   7          CONECT  791  247  826  ...  ...  ...  ...           2INS 146
REMARK   7          CONECT  792  624  942  ...  ...  ...  ...           2INS 147
REMARK   7          CONECT  826  791                                    2INS 148
REMARK   7          CONECT  942  792                                    2INS 149
REMARK   7            .                                                 2INS 150
REMARK   7            .                                                 2INS 151
REMARK   7            .                                                 2INS 152
REMARK   8                                                              2INS 153
REMARK   8 NO DENSITY WAS OBSERVED FOR TYR C 14 INDICATING THAT THIS    2INS 154
REMARK   8 SIDE CHAIN WAS MOVING FREELY.                                2INS 155
REMARK   9                                                              2INSA 20
REMARK   9 CORRECTION. CORRECT JOURNAL NAME FOR REFERENCES 2 AND 4.     2INSA 21
REMARK   9  UPDATE JRNL REFERENCE TO REFLECT PUBLICATION.  CORRECT      2INSA 22
REMARK   9  MTRIX TRANSFORMATION.  REVISE REMARKS 5 AND 7.  07-MAR-83.  2INSA 23
REMARK  10                                                              2INSB  3
REMARK  10 CORRECTION. INSERT TYPESETTING CODES.  13-JUN-83.            2INSB  4
REMARK  11                                                              2INSC  5
REMARK  11 CORRECTION. INSERT REVDAT RECORDS. 30-SEP-83.                2INSC  6
REMARK  12                                                              2INSD 24
REMARK  12 CORRECTION. ADD NEW PUBLICATION AS REFERENCE 1 AND           2INSD 25
REMARK  12  RENUMBER THE OTHERS.  27-OCT-83.                            2INSD 26
REMARK  13                                                              2INSE  3
REMARK  13 CORRECTION. INSERT MISSING CODEN FOR REFERENCE 1.            2INSE  4
REMARK  13  31-JAN-84.                                                  2INSE  5
REMARK  14                                                              2INSF  3
REMARK  14 CORRECTION. CORRECT ISSN FOR REFERENCE 9.  31-MAY-84.        2INSF  4
SEQRES   1 A   21  GLY ILE VAL GLU GLN CYS CYS ALA SER VAL CYS SER LEU  2INS 156
SEQRES   2 A   21  TYR GLN LEU GLU ASN TYR CYS ASN                      2INS 157
SEQRES   1 B   29  VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU ALA  2INS 158
SEQRES   2 B   29  LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR THR  2INS 159
SEQRES   3 B   29  PRO LYS ALA                                          2INS 160
SEQRES   1 C   21  GLY ILE VAL GLU GLN CYS CYS ALA SER VAL CYS SER LEU  2INS 161
SEQRES   2 C   21  TYR GLN LEU GLU ASN TYR CYS ASN                      2INS 162
SEQRES   1 D   29  VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU ALA  2INS 163
SEQRES   2 D   29  LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR THR  2INS 164
SEQRES   3 D   29  PRO LYS ALA                                          2INS 165
FTNOTE   1                                                              2INS 166
FTNOTE   1 THE QUASI-TWO-FOLD SYMMETRY BREAKS DOWN MOST SERIOUSLY AT    2INS 167
FTNOTE   1 RESIDUES                                                     2INS 168
FTNOTE   1    GLY A  1 TO GLN A  5   AND   GLY C  1 TO GLN C  5         2INS 169
FTNOTE   1    HIS B  5               AND   HIS D  5                     2INS 170
FTNOTE   1    PHE B 25               AND   PHE D 25                     2INS 171
FTNOTE   2                                                              2INS 172
FTNOTE   2 THE FOLLOWING RESIDUES ARE DISORDERED - ARG B 22,            2INS 173
FTNOTE   2 LYS D 29.                                                    2INS 174
FTNOTE   3                                                              2INS 175
FTNOTE   3 SEE REMARK 8.                                                2INS 176
HET    ZN1      1       1     ZINC ION ON 3-FOLD CRYSTAL AXIS           2INS 177
HET    ZN2      2       1     ZINC ION ON 3-FOLD CRYSTAL AXIS           2INS 178
FORMUL   5  ZN1    ZN1 ++                                               2INS 179
FORMUL   6  ZN2    ZN1 ++                                               2INS 180
FORMUL   7  HOH   *184(H2 01)                                           2INS 181
HELIX    1 A11 GLY A    1  VAL A   10  1 NOT IDEAL ALPH,SOME PI CNTCTS  2INS 182
HELIX    2 A12 SER A   12  GLU A   17  5 NOT IDEAL 3(10)                2INS 183
HELIX    3 B11 SER B    9  GLY B   20  1 NOT IDEAL ALPH,3(10) CONTCTS   2INS 184
HELIX    4 A21 GLY C    1  VAL C   10  1 NOT IDEAL ALPH,SOME PI CNTCTS  2INS 185
HELIX    5 A22 SER C   12  GLU C   17  5 NOT IDEAL 3(10)                2INS 186
HELIX    6 B21 SER D    9  GLY D   20  1 NOT IDEAL ALPH,3(10) CONTCTS   2INS 187
SHEET    1   B 2 PHE B  24  TYR B  26  0                                2INS 188
SHEET    2   B 2 PHE D  24  TYR D  26 -1  O  TYR D  26   N  PHE B  24   2INS 189
TURN     1 1B1 CYS B  19  ARG B  22                                     2INS 190
TURN     2 1B2 GLY B  20  GLY B  23                                     2INS 191
TURN     3 2B1 CYS D  19  ARG D  22                                     2INS 192
TURN     4 2B2 GLY D  20  GLY D  23                                     2INS 193
SSBOND   1 CYS A    6    CYS A   11                                     2INS 194
SSBOND   2 CYS C    6    CYS C   11                                     2INS 195
SSBOND   3 CYS A    7    CYS B    7                                     2INS 196
SSBOND   4 CYS A   20    CYS B   19                                     2INS 197
SSBOND   5 CYS C    7    CYS D    7                                     2INS 198
SSBOND   6 CYS C   20    CYS D   19                                     2INS 199
SITE     1  D1  5 VAL B  12  TYR B  16  PHE B  24  PHE B  25            2INS 200
SITE     2  D1  5 TYR B  26                                             2INS 201
SITE     1  D2  5 VAL D  12  TYR D  16  PHE D  24  PHE D  25            2INS 202
SITE     2  D2  5 TYR D  26                                             2INS 203
SITE     1  H1  6 LEU A  13  TYR A  14  GLU B  13  ALA B  14            2INS 204
SITE     2  H1  6 LEU B  17  VAL B  18                                  2INS 205
SITE     1  H2  6 LEU C  13  TYR C  14  GLU D  13  ALA D  14            2INS 206
SITE     2  H2  6 LEU D  17  VAL D  18                                  2INS 207
SITE     1 SI1  7 GLY A   1  GLU A   4  GLN A   5  CYS A   7            2INS 208
SITE     2 SI1  7 TYR A  19  ASN A  21  CYS B   7                       2INS 209
SITE     1 SI2  7 GLY C   1  GLU C   4  GLN C   5  CYS C   7            2INS 210
SITE     2 SI2  7 TYR C  19  ASN C  21  CYS D   7                       2INS 211
CRYST1   81.600   81.600   34.000  90.00  90.00 120.00 R 3          18  2INS 212
ORIGX1       .012255   .007075  0.000000        0.00000                 2INS 213
ORIGX2      0.000000   .014151  0.000000        0.00000                 2INS 214
ORIGX3      0.000000  0.000000   .029412        0.00000                 2INS 215
SCALE1       .012255   .007075  0.000000        0.00000                 2INS 216
SCALE2      0.000000   .014151  0.000000        0.00000                 2INS 217
SCALE3      0.000000  0.000000   .029412        0.00000                 2INS 218
MTRIX1   1  -.880000  -.480000   .020000        0.00000    1            2INSA 24
MTRIX2   1  -.480000   .880000  -.020000        0.00000    1            2INSA 25
MTRIX3   1  -.010000  -.030000 -1.000000        0.00000    1            2INSA 26
END

A.1.41. PDB (nucleotide)

The Protein Data Bank is a collection of 3D structures. PDB files may contain nucleotide sequences for one or more nucleic acid molecules. These are in two forms, either as sequence residues in the SEQRES records (format pdbnucseq) or as structural elements in the ATOM records (format pdbnuc). EMBOSS reads both versions. If the structure is incomplete the SEQRES records will have the full sequence but parts will be missing from the ATOM records. For further information see:

http://www.wwpdb.org/
HEADER    TRANSCRIPTION/DNA                       08-DEC-97   1A02              
TITLE     STRUCTURE OF THE DNA BINDING DOMAINS OF NFAT, FOS AND JUN             
TITLE    2 BOUND TO DNA                                                         
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: DNA (5'-                                                   
COMPND   3 D(*DTP*DTP*DGP*DGP*DAP*DAP*DAP*DAP*DTP*DTP*DTP*DGP*DTP*DTP*          
COMPND   4 DTP*DCP*DAP*DTP*DAP*DG)-3');                                         
COMPND   5 CHAIN: A;                                                            
COMPND   6 ENGINEERED: YES;                                                     
COMPND   7 MOL_ID: 2;                                                           

... < data omitted for brevity >

COMPND  26 MOLECULE: AP-1 FRAGMENT JUN;                                         
COMPND  27 CHAIN: J;                                                            
COMPND  28 FRAGMENT: JUN;                                                       
COMPND  29 SYNONYM: JUN;                                                        
COMPND  30 ENGINEERED: YES;                                                     
COMPND  31 MUTATION: YES                                                        
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 SYNTHETIC: YES;                                                      
SOURCE   3 MOL_ID: 2;                                                           
SOURCE   4 SYNTHETIC: YES;                                                      
SOURCE   5 MOL_ID: 3;                                                           
SOURCE   6 ORGANISM_SCIENTIFIC: HOMO SAPIENS;                                   
SOURCE   7 ORGANISM_COMMON: HUMAN;                                              

... < data omitted for brevity >

SOURCE  24 ORGANISM_TAXID: 9606;                                                
SOURCE  25 EXPRESSION_SYSTEM: ESCHERICHIA COLI;                                 
SOURCE  26 EXPRESSION_SYSTEM_TAXID: 562                                         
KEYWDS    TRANSCRIPTION FACTOR, NFAT, NF-AT, AP-1, FOS-JUN,                     
KEYWDS   2 QUATERNARY PROTEIN-DNA COMPLEX, TRANSCRIPTION SYNERGY,               
KEYWDS   3 COMBINATORIAL GENE REGULATION, TRANSCRIPTION/DNA COMPLEX             
EXPDTA    X-RAY DIFFRACTION                                                     
AUTHOR    L.CHEN,J.N.M.GLOVER,P.G.HOGAN,A.RAO,S.C.HARRISON                      
REVDAT   2   24-FEB-09 1A02    1       VERSN                                    
REVDAT   1   27-MAY-98 1A02    0                                                
JRNL        AUTH   L.CHEN,J.N.GLOVER,P.G.HOGAN,A.RAO,S.C.HARRISON               
JRNL        TITL   STRUCTURE OF THE DNA-BINDING DOMAINS FROM NFAT,              
JRNL        TITL 2 FOS AND JUN BOUND SPECIFICALLY TO DNA.                       
JRNL        REF    NATURE                        V. 392    42 1998              
JRNL        REFN                   ISSN 0028-0836                               
JRNL        PMID   9510247                                                      
JRNL        DOI    10.1038/32100                                                
REMARK   1                                                                      
REMARK   2                                                                      
REMARK   2 RESOLUTION.    2.70 ANGSTROMS.                                       
REMARK   3                                                                      
REMARK   3 REFINEMENT.                                                          
REMARK   3   PROGRAM     : X-PLOR 3.1                                           
REMARK   3   AUTHORS     : BRUNGER                                              
REMARK   3                                                                      
REMARK   3  DATA USED IN REFINEMENT.                                            
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 2.70                           
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 10.00                          
REMARK   3   DATA CUTOFF            (SIGMA(F)) : 2.000                          
REMARK   3   DATA CUTOFF HIGH         (ABS(F)) : 10000000.000                   
REMARK   3   DATA CUTOFF LOW          (ABS(F)) : 0.1000                         
REMARK   3   COMPLETENESS (WORKING+TEST)   (%) : 90.1                           
REMARK   3   NUMBER OF REFLECTIONS             : 21643                          
REMARK   3                                                                      
REMARK   3  FIT TO DATA USED IN REFINEMENT.                                     
REMARK   3   CROSS-VALIDATION METHOD          : THROUGHOUT                      
REMARK   3   FREE R VALUE TEST SET SELECTION  : RANDOM                          
REMARK   3   R VALUE            (WORKING SET) : 0.246                           
REMARK   3   FREE R VALUE                     : 0.303                           
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 7.500                           
REMARK   3   FREE R VALUE TEST SET COUNT      : 1671                            
REMARK   3   ESTIMATED ERROR OF FREE R VALUE  : 0.010                           
REMARK   3                                                                      
REMARK   3  FIT IN THE HIGHEST RESOLUTION BIN.                                  
REMARK   3   TOTAL NUMBER OF BINS USED           : 8                            
REMARK   3   BIN RESOLUTION RANGE HIGH       (A) : 2.70                         
REMARK   3   BIN RESOLUTION RANGE LOW        (A) : 2.82                         
REMARK   3   BIN COMPLETENESS (WORKING+TEST) (%) : 69.40                        
REMARK   3   REFLECTIONS IN BIN    (WORKING SET) : 1784                         
REMARK   3   BIN R VALUE           (WORKING SET) : 0.3690                       
REMARK   3   BIN FREE R VALUE                    : 0.3690                       
REMARK   3   BIN FREE R VALUE TEST SET SIZE  (%) : 6.60                         
REMARK   3   BIN FREE R VALUE TEST SET COUNT     : 188                          
REMARK   3   ESTIMATED ERROR OF BIN FREE R VALUE : 0.020                        
REMARK   3                                                                      
REMARK   3  NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.                    
REMARK   3   PROTEIN ATOMS            : 3073                                    
REMARK   3   NUCLEIC ACID ATOMS       : 814                                     
REMARK   3   HETEROGEN ATOMS          : 0                                       
REMARK   3   SOLVENT ATOMS            : 88                                      
REMARK   3                                                                      
REMARK   3  B VALUES.                                                           
REMARK   3   FROM WILSON PLOT           (A**2) : 61.00                          
REMARK   3   MEAN B VALUE      (OVERALL, A**2) : 51.00                          
REMARK   3   OVERALL ANISOTROPIC B VALUE.                                       
REMARK   3    B11 (A**2) : NULL                                                 
REMARK   3    B22 (A**2) : NULL                                                 
REMARK   3    B33 (A**2) : NULL                                                 
REMARK   3    B12 (A**2) : NULL                                                 
REMARK   3    B13 (A**2) : NULL                                                 
REMARK   3    B23 (A**2) : NULL                                                 
REMARK   3                                                                      
REMARK   3  ESTIMATED COORDINATE ERROR.                                         
REMARK   3   ESD FROM LUZZATI PLOT        (A) : NULL                            
REMARK   3   ESD FROM SIGMAA              (A) : NULL                            
REMARK   3   LOW RESOLUTION CUTOFF        (A) : 10.00                           
REMARK   3                                                                      
REMARK   3  CROSS-VALIDATED ESTIMATED COORDINATE ERROR.                         
REMARK   3   ESD FROM C-V LUZZATI PLOT    (A) : NULL                            
REMARK   3   ESD FROM C-V SIGMAA          (A) : NULL                            
REMARK   3                                                                      
REMARK   3  RMS DEVIATIONS FROM IDEAL VALUES.                                   
REMARK   3   BOND LENGTHS                 (A) : NULL                            
REMARK   3   BOND ANGLES            (DEGREES) : NULL                            
REMARK   3   DIHEDRAL ANGLES        (DEGREES) : NULL                            
REMARK   3   IMPROPER ANGLES        (DEGREES) : NULL                            
REMARK   3                                                                      
REMARK   3  ISOTROPIC THERMAL MODEL : RESTRAINED                                
REMARK   3                                                                      
REMARK   3  ISOTROPIC THERMAL FACTOR RESTRAINTS.    RMS    SIGMA                
REMARK   3   MAIN-CHAIN BOND              (A**2) : 1.500 ; 1.500                
REMARK   3   MAIN-CHAIN ANGLE             (A**2) : 2.000 ; 2.000                
REMARK   3   SIDE-CHAIN BOND              (A**2) : 2.000 ; 2.000                
REMARK   3   SIDE-CHAIN ANGLE             (A**2) : 2.500 ; 2.500                
REMARK   3                                                                      
REMARK   3  NCS MODEL : NULL                                                    
REMARK   3                                                                      
REMARK   3  NCS RESTRAINTS.                         RMS   SIGMA/WEIGHT          
REMARK   3   GROUP  1  POSITIONAL            (A) : NULL  ; NULL                 
REMARK   3   GROUP  1  B-FACTOR           (A**2) : NULL  ; NULL                 
REMARK   3                                                                      
REMARK   3  PARAMETER FILE  1  : PARHCSDX.PRO                                   
REMARK   3  PARAMETER FILE  2  : PARAM_NDBX.DNA                                 
REMARK   3  PARAMETER FILE  3  : PARAM_NDBX.INT                                 
REMARK   3  PARAMETER FILE  4  : PARAM19.SOL                                    
REMARK   3  PARAMETER FILE  5  : NULL                                           
REMARK   3  TOPOLOGY FILE  1   : TOPHCSDX.PRO                                   
REMARK   3  TOPOLOGY FILE  2   : TOP_NDBX.DNA                                   
REMARK   3  TOPOLOGY FILE  3   : TOPH19.SOL                                     
REMARK   3  TOPOLOGY FILE  4   : NULL                                           
REMARK   3  TOPOLOGY FILE  5   : NULL                                           
REMARK   3                                                                      
REMARK   3  OTHER REFINEMENT REMARKS: RESIDUES N 478 - N 485 AND N 628 - N      
REMARK   3  634 ARE DISORDERED                                                  
REMARK   4                                                                      
REMARK   4 1A02 COMPLIES WITH FORMAT V. 3.15, 01-DEC-08                         
REMARK 100                                                                      
REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY BNL.                                
REMARK 200                                                                      
REMARK 200 EXPERIMENTAL DETAILS                                                 
REMARK 200  EXPERIMENT TYPE                : X-RAY DIFFRACTION                  
REMARK 200  DATE OF DATA COLLECTION        : 16-SEP-96                          
REMARK 200  TEMPERATURE           (KELVIN) : 100.00                             
REMARK 200  PH                             : 7.5                                
REMARK 200  NUMBER OF CRYSTALS USED        : 1                                  
REMARK 200                                                                      
REMARK 200  SYNCHROTRON              (Y/N) : Y                                  
REMARK 200  RADIATION SOURCE               : NSLS                               
REMARK 200  BEAMLINE                       : X25                                
REMARK 200  X-RAY GENERATOR MODEL          : NULL                               
REMARK 200  MONOCHROMATIC OR LAUE    (M/L) : M                                  
REMARK 200  WAVELENGTH OR RANGE        (A) : NULL                               
REMARK 200  MONOCHROMATOR                  : NULL                               
REMARK 200  OPTICS                         : NULL                               
REMARK 200                                                                      
REMARK 200  DETECTOR TYPE                  : IMAGE PLATE                        
REMARK 200  DETECTOR MANUFACTURER          : MARRESEARCH                        
REMARK 200  INTENSITY-INTEGRATION SOFTWARE : DENZO                              
REMARK 200  DATA SCALING SOFTWARE          : SCALEPACK                          
REMARK 200                                                                      
REMARK 200  NUMBER OF UNIQUE REFLECTIONS   : 22079                              
REMARK 200  RESOLUTION RANGE HIGH      (A) : 2.700                              
REMARK 200  RESOLUTION RANGE LOW       (A) : 20.000                             
REMARK 200  REJECTION CRITERIA  (SIGMA(I)) : 0.000                              
REMARK 200                                                                      
REMARK 200 OVERALL.                                                             
REMARK 200  COMPLETENESS FOR RANGE     (%) : 98.3                               
REMARK 200  DATA REDUNDANCY                : 3.100                              
REMARK 200  R MERGE                    (I) : NULL                               
REMARK 200  R SYM                      (I) : 0.08000                            
REMARK 200  <I/SIGMA(I)> FOR THE DATA SET  : NULL                               
REMARK 200                                                                      
REMARK 200 IN THE HIGHEST RESOLUTION SHELL.                                     
REMARK 200  HIGHEST RESOLUTION SHELL, RANGE HIGH (A) : 2.70                     
REMARK 200  HIGHEST RESOLUTION SHELL, RANGE LOW  (A) : 2.80                     
REMARK 200  COMPLETENESS FOR SHELL     (%) : 93.3                               
REMARK 200  DATA REDUNDANCY IN SHELL       : 2.70                               
REMARK 200  R MERGE FOR SHELL          (I) : NULL                               
REMARK 200  R SYM FOR SHELL            (I) : 0.43000                            
REMARK 200  <I/SIGMA(I)> FOR SHELL         : NULL                               
REMARK 200                                                                      
REMARK 200 DIFFRACTION PROTOCOL: SINGLE WAVELENGTH                              
REMARK 200 METHOD USED TO DETERMINE THE STRUCTURE: MIR/MAD                      
REMARK 200 SOFTWARE USED: CCP4, X-PLOR                                          
REMARK 200 STARTING MODEL: NULL                                                 
REMARK 200                                                                      
REMARK 200 REMARK: NULL                                                         
REMARK 280                                                                      
REMARK 280 CRYSTAL                                                              
REMARK 280 SOLVENT CONTENT, VS   (%): 68.00                                     
REMARK 280 MATTHEWS COEFFICIENT, VM (ANGSTROMS**3/DA): 3.54                     
REMARK 280                                                                      
REMARK 280 CRYSTALLIZATION CONDITIONS: THE COMPLEX WAS CRYSTALLIZED IN          
REMARK 280  300-400 MM AMMONIUM ACETATE SALT, PH 7.5 (10 MM)., VAPOR            
REMARK 280  DIFFUSION, HANGING DROP                                             
REMARK 290                                                                      
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY                                            
REMARK 290 SYMMETRY OPERATORS FOR SPACE GROUP: P 1 21 1                         
REMARK 290                                                                      
REMARK 290      SYMOP   SYMMETRY                                                
REMARK 290     NNNMMM   OPERATOR                                                
REMARK 290       1555   X,Y,Z                                                   
REMARK 290       2555   -X,Y+1/2,-Z                                             
REMARK 290                                                                      
REMARK 290     WHERE NNN -> OPERATOR NUMBER                                     
REMARK 290           MMM -> TRANSLATION VECTOR                                  
REMARK 290                                                                      
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY TRANSFORMATIONS                            
REMARK 290 THE FOLLOWING TRANSFORMATIONS OPERATE ON THE ATOM/HETATM             
REMARK 290 RECORDS IN THIS ENTRY TO PRODUCE CRYSTALLOGRAPHICALLY                
REMARK 290 RELATED MOLECULES.                                                   
REMARK 290   SMTRY1   1  1.000000  0.000000  0.000000        0.00000            
REMARK 290   SMTRY2   1  0.000000  1.000000  0.000000        0.00000            
REMARK 290   SMTRY3   1  0.000000  0.000000  1.000000        0.00000            
REMARK 290   SMTRY1   2 -1.000000  0.000000  0.000000        0.00000            
REMARK 290   SMTRY2   2  0.000000  1.000000  0.000000       42.73000            
REMARK 290   SMTRY3   2  0.000000  0.000000 -1.000000        0.00000            
REMARK 290                                                                      
REMARK 290 REMARK: NULL                                                         
REMARK 300                                                                      
REMARK 300 BIOMOLECULE: 1                                                       
REMARK 300 SEE REMARK 350 FOR THE AUTHOR PROVIDED AND/OR PROGRAM                
REMARK 300 GENERATED ASSEMBLY INFORMATION FOR THE STRUCTURE IN                  
REMARK 300 THIS ENTRY. THE REMARK MAY ALSO PROVIDE INFORMATION ON               
REMARK 300 BURIED SURFACE AREA.                                                 
REMARK 350                                                                      
REMARK 350 COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN           
REMARK 350 BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE                
REMARK 350 MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS          
REMARK 350 GIVEN BELOW.  BOTH NON-CRYSTALLOGRAPHIC AND                          
REMARK 350 CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN.                               
REMARK 350                                                                      
REMARK 350 BIOMOLECULE: 1                                                       
REMARK 350 AUTHOR DETERMINED BIOLOGICAL UNIT: PENTAMERIC                        
REMARK 350 SOFTWARE DETERMINED QUATERNARY STRUCTURE: PENTAMERIC                 
REMARK 350 SOFTWARE USED: PISA                                                  
REMARK 350 TOTAL BURIED SURFACE AREA: 9500 ANGSTROM**2                          
REMARK 350 SURFACE AREA OF THE COMPLEX: 26430 ANGSTROM**2                       
REMARK 350 CHANGE IN SOLVENT FREE ENERGY: -43.0 KCAL/MOL                        
REMARK 350 APPLY THE FOLLOWING TO CHAINS: N, A, B, F, J                         
REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000        0.00000            
REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000        0.00000            
REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000        0.00000            
REMARK 465                                                                      
REMARK 465 MISSING RESIDUES                                                     
REMARK 465 THE FOLLOWING RESIDUES WERE NOT LOCATED IN THE                       
REMARK 465 EXPERIMENT. (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN               
REMARK 465 IDENTIFIER; SSSEQ=SEQUENCE NUMBER; I=INSERTION CODE.)                
REMARK 465                                                                      
REMARK 465   M RES C SSSEQI                                                     
REMARK 465     MET N   378                                                      
REMARK 465     ARG N   379                                                      
REMARK 465     GLY N   380                                                      
REMARK 465     SER N   381                                                      
REMARK 465     HIS N   382                                                      
REMARK 465     HIS N   383                                                      
REMARK 465     HIS N   384                                                      
REMARK 465     HIS N   385                                                      
REMARK 465     HIS N   386                                                      
REMARK 465     HIS N   387                                                      
REMARK 465     THR N   388                                                      
REMARK 465     ASP N   389                                                      
REMARK 465     PRO N   390                                                      
REMARK 465     HIS N   391                                                      
REMARK 465     ALA N   392                                                      
REMARK 465     SER N   393                                                      
REMARK 465     SER N   394                                                      
REMARK 465     VAL N   395                                                      
REMARK 465     PRO N   396                                                      
REMARK 465     LEU N   397                                                      
REMARK 465     GLU N   398                                                      
REMARK 465     MET F   138                                                      
REMARK 465     LYS F   139                                                      
REMARK 465     LEU F   193                                                      
REMARK 465     MET J   263                                                      
REMARK 465     LYS J   264                                                      
REMARK 465     ALA J   265                                                      
REMARK 465     GLU J   266                                                      
REMARK 470                                                                      
REMARK 470 MISSING ATOM                                                         
REMARK 470 THE FOLLOWING RESIDUES HAVE MISSING ATOMS(M=MODEL NUMBER;            
REMARK 470 RES=RESIDUE NAME; C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;          
REMARK 470 I=INSERTION CODE):                                                   
REMARK 470   M RES CSSEQI  ATOMS                                                
REMARK 470     ARG N 478    CG   CD   NE   CZ   NH1  NH2                        
REMARK 470     ILE N 479    CG1  CG2  CD1                                       
REMARK 470     THR N 480    OG1  CG2                                            
REMARK 470     THR N 483    OG1  CG2                                            
REMARK 470     VAL N 484    CG1  CG2                                            
REMARK 470     THR N 485    OG1  CG2                                            
REMARK 470     ASP N 629    CG   OD1  OD2                                       
REMARK 470     LYS N 630    CG   CD   CE   NZ                                   
REMARK 470     ASP N 631    CG   OD1  OD2                                       
REMARK 470     LYS N 632    CG   CD   CE   NZ                                   
REMARK 470     SER N 633    OG                                                  
REMARK 470     GLN N 634    CG   CD   OE1  NE2                                  
REMARK 500                                                                      
REMARK 500 GEOMETRY AND STEREOCHEMISTRY                                         
REMARK 500 SUBTOPIC: COVALENT BOND ANGLES                                       
REMARK 500                                                                      
REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES              
REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE               
REMARK 500 THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN               
REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).                 
REMARK 500                                                                      
REMARK 500 STANDARD TABLE:                                                      
REMARK 500 FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1)              
REMARK 500                                                                      
REMARK 500 EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999                        
REMARK 500 EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996                     
REMARK 500                                                                      
REMARK 500  M RES CSSEQI ATM1   ATM2   ATM3                                     
REMARK 500     DA A4005   O4' -  C1' -  N9  ANGL. DEV. =   2.1 DEGREES          
REMARK 500     DG A4020   N9  -  C1' -  C2' ANGL. DEV. =   8.6 DEGREES          
REMARK 500     DA B5008   O4' -  C1' -  N9  ANGL. DEV. =   1.9 DEGREES          
REMARK 500     DC B5011   C3' -  C2' -  C1' ANGL. DEV. =  -5.1 DEGREES          
REMARK 500     DC B5011   N1  -  C1' -  C2' ANGL. DEV. =   9.7 DEGREES          
REMARK 500     DC B5011   O4' -  C1' -  N1  ANGL. DEV. =   3.0 DEGREES          
REMARK 500    ARG N 411   NE  -  CZ  -  NH2 ANGL. DEV. =   3.6 DEGREES          
REMARK 500    ARG N 466   NE  -  CZ  -  NH2 ANGL. DEV. =   3.3 DEGREES          
REMARK 500    ARG J 282   NE  -  CZ  -  NH2 ANGL. DEV. =   3.6 DEGREES          
REMARK 500                                                                      
REMARK 500 REMARK: NULL                                                         
REMARK 500                                                                      
REMARK 500 GEOMETRY AND STEREOCHEMISTRY                                         
REMARK 500 SUBTOPIC: TORSION ANGLES                                             
REMARK 500                                                                      
REMARK 500 TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:            
REMARK 500 (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;               
REMARK 500 SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).                             
REMARK 500                                                                      
REMARK 500 STANDARD TABLE:                                                      
REMARK 500 FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)                    
REMARK 500                                                                      
REMARK 500 EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSI-           
REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400            
REMARK 500                                                                      
REMARK 500  M RES CSSEQI        PSI       PHI                                   
REMARK 500    SER N 405      100.91    -51.63                                   
REMARK 500    GLU N 409       90.38     70.97                                   
REMARK 500    HIS N 420       86.75   -161.88                                   
REMARK 500    ASN N 451       53.96   -146.42                                   
REMARK 500    THR N 462     -172.45    -63.42                                   
REMARK 500    ALA N 463     -146.20   -158.78                                   
REMARK 500    ASP N 464     -162.81     60.49                                   
REMARK 500    THR N 480       -0.98    -52.59                                   
REMARK 500    LYS N 482     -162.09   -110.34                                   
REMARK 500    ASN N 495       -8.57     86.88                                   
REMARK 500    ARG N 537      104.93    -41.22                                   
REMARK 500    CYS N 588      169.94    174.80                                   
REMARK 500    VAL N 590      -30.50    -32.58                                   
REMARK 500    THR N 604     -141.10    -89.57                                   
REMARK 500    ASP N 631      -53.08   -172.49                                   
REMARK 500    SER N 633     -104.65   -168.34                                   
REMARK 500    PRO N 635       13.91    -61.60                                   
REMARK 500    ASN N 636       31.15   -155.42                                   
REMARK 500    LYS N 664      -42.18   -147.02                                   
REMARK 500                                                                      
REMARK 500 REMARK: NULL                                                         
DBREF  1A02 N  396   678  UNP    Q13469   NFAC2_HUMAN    396    678             
DBREF  1A02 F  138   193  UNP    P01100   FOS_HUMAN      138    193             
DBREF  1A02 J  263   318  UNP    P05412   AP1_HUMAN      253    308             
DBREF  1A02 A 4001  4020  PDB    1A02     1A02          4001   4020             
DBREF  1A02 B 5001  5020  PDB    1A02     1A02          5001   5020             
SEQADV 1A02 MET F  138  UNP  P01100    GLU   138 ENGINEERED                     
SEQADV 1A02 SER F  154  UNP  P01100    CYS   154 ENGINEERED                     
SEQADV 1A02 MET J  263  UNP  P05412    ILE   253 ENGINEERED                     
SEQADV 1A02 SER J  279  UNP  P05412    CYS   269 ENGINEERED                     
SEQRES   1 A   20   DT  DT  DG  DG  DA  DA  DA  DA  DT  DT  DT  DG  DT          
SEQRES   2 A   20   DT  DT  DC  DA  DT  DA  DG                                  
SEQRES   1 B   20   DA  DA  DC  DT  DA  DT  DG  DA  DA  DA  DC  DA  DA          
SEQRES   2 B   20   DA  DT  DT  DT  DT  DC  DC                                  
SEQRES   1 N  301  MET ARG GLY SER HIS HIS HIS HIS HIS HIS THR ASP PRO          
SEQRES   2 N  301  HIS ALA SER SER VAL PRO LEU GLU TRP PRO LEU SER SER          
SEQRES   3 N  301  GLN SER GLY SER TYR GLU LEU ARG ILE GLU VAL GLN PRO          
SEQRES   4 N  301  LYS PRO HIS HIS ARG ALA HIS TYR GLU THR GLU GLY SER          
SEQRES   5 N  301  ARG GLY ALA VAL LYS ALA PRO THR GLY GLY HIS PRO VAL          
SEQRES   6 N  301  VAL GLN LEU HIS GLY TYR MET GLU ASN LYS PRO LEU GLY          
SEQRES   7 N  301  LEU GLN ILE PHE ILE GLY THR ALA ASP GLU ARG ILE LEU          
SEQRES   8 N  301  LYS PRO HIS ALA PHE TYR GLN VAL HIS ARG ILE THR GLY          
SEQRES   9 N  301  LYS THR VAL THR THR THR SER TYR GLU LYS ILE VAL GLY          
SEQRES  10 N  301  ASN THR LYS VAL LEU GLU ILE PRO LEU GLU PRO LYS ASN          
SEQRES  11 N  301  ASN MET ARG ALA THR ILE ASP CYS ALA GLY ILE LEU LYS          
SEQRES  12 N  301  LEU ARG ASN ALA ASP ILE GLU LEU ARG LYS GLY GLU THR          
SEQRES  13 N  301  ASP ILE GLY ARG LYS ASN THR ARG VAL ARG LEU VAL PHE          
SEQRES  14 N  301  ARG VAL HIS ILE PRO GLU SER SER GLY ARG ILE VAL SER          
SEQRES  15 N  301  LEU GLN THR ALA SER ASN PRO ILE GLU CYS SER GLN ARG          
SEQRES  16 N  301  SER ALA HIS GLU LEU PRO MET VAL GLU ARG GLN ASP THR          
SEQRES  17 N  301  ASP SER CYS LEU VAL TYR GLY GLY GLN GLN MET ILE LEU          
SEQRES  18 N  301  THR GLY GLN ASN PHE THR SER GLU SER LYS VAL VAL PHE          
SEQRES  19 N  301  THR GLU LYS THR THR ASP GLY GLN GLN ILE TRP GLU MET          
SEQRES  20 N  301  GLU ALA THR VAL ASP LYS ASP LYS SER GLN PRO ASN MET          
SEQRES  21 N  301  LEU PHE VAL GLU ILE PRO GLU TYR ARG ASN LYS HIS ILE          
SEQRES  22 N  301  ARG THR PRO VAL LYS VAL ASN PHE TYR VAL ILE ASN GLY          
SEQRES  23 N  301  LYS ARG LYS ARG SER GLN PRO GLN HIS PHE THR TYR HIS          
SEQRES  24 N  301  PRO VAL                                                      
SEQRES   1 F   56  MET LYS ARG ARG ILE ARG ARG GLU ARG ASN LYS MET ALA          
SEQRES   2 F   56  ALA ALA LYS SER ARG ASN ARG ARG ARG GLU LEU THR ASP          
SEQRES   3 F   56  THR LEU GLN ALA GLU THR ASP GLN LEU GLU ASP GLU LYS          
SEQRES   4 F   56  SER ALA LEU GLN THR GLU ILE ALA ASN LEU LEU LYS GLU          
SEQRES   5 F   56  LYS GLU LYS LEU                                              
SEQRES   1 J   56  MET LYS ALA GLU ARG LYS ARG MET ARG ASN ARG ILE ALA          
SEQRES   2 J   56  ALA SER LYS SER ARG LYS ARG LYS LEU GLU ARG ILE ALA          
SEQRES   3 J   56  ARG LEU GLU GLU LYS VAL LYS THR LEU LYS ALA GLN ASN          
SEQRES   4 J   56  SER GLU LEU ALA SER THR ALA ASN MET LEU ARG GLU GLN          
SEQRES   5 J   56  VAL ALA GLN LEU                                              
FORMUL   6  HOH   *88(H2 O)                                                     
HELIX    1   1 ARG N  522  GLU N  527  1                                   6    
HELIX    2   2 SER N  570  LEU N  577  1                                   8    
HELIX    3   3 ARG F  140  LYS F  192  1                                  53    
HELIX    4   4 ARG J  267  GLN J  317  1                                  51    
SHEET    1   A 3 LEU N 410  VAL N 414  0                                        
SHEET    2   A 3 VAL N 442  LEU N 445 -1  O  VAL N 442   N  GLU N 413           
SHEET    3   A 3 ARG N 510  THR N 512 -1  O  ALA N 511   N  VAL N 443           
SHEET    1   B 5 GLU N 490  ILE N 492  0                                        
SHEET    2   B 5 VAL N 498  LEU N 503 -1  N  VAL N 498   O  ILE N 492           
SHEET    3   B 5 LEU N 454  GLY N 461 -1  O  LEU N 454   N  LEU N 503           
SHEET    4   B 5 ARG N 541  PRO N 551 -1  O  ARG N 543   N  GLY N 461           
SHEET    5   B 5 ILE N 557  ALA N 563 -1  O  VAL N 558   N  ILE N 550           
SHEET    1   C 5 GLU N 490  ILE N 492  0                                        
SHEET    2   C 5 VAL N 498  LEU N 503 -1  N  VAL N 498   O  ILE N 492           
SHEET    3   C 5 LEU N 454  GLY N 461 -1  O  LEU N 454   N  LEU N 503           
SHEET    4   C 5 ARG N 541  PRO N 551 -1  O  ARG N 543   N  GLY N 461           
SHEET    5   C 5 ILE N 567  GLU N 568 -1  N  ILE N 567   O  VAL N 542           
SHEET    1   D 2 TYR N 474  ARG N 478  0                                        
SHEET    2   D 2 ALA N 516  LYS N 520 -1  N  GLY N 517   O  HIS N 477           
SHEET    1   E 4 MET N 579  GLN N 583  0                                        
SHEET    2   E 4 GLN N 595  GLN N 601 -1  O  THR N 599   N  GLU N 581           
SHEET    3   E 4 MET N 637  GLU N 641 -1  O  LEU N 638   N  LEU N 598           
SHEET    4   E 4 VAL N 628  GLN N 634 -1  N  ASP N 629   O  PHE N 639           
SHEET    1   F 4 GLN N 620  ALA N 626  0                                        
SHEET    2   F 4 LYS N 608  LYS N 614 -1  O  VAL N 609   N  ALA N 626           
SHEET    3   F 4 VAL N 654  ASN N 662 -1  O  ASN N 657   N  THR N 612           
SHEET    4   F 4 LYS N 666  ARG N 667 -1  O  LYS N 666   N  ASN N 662           
SHEET    1   G 5 GLN N 620  ALA N 626  0                                        
SHEET    2   G 5 LYS N 608  LYS N 614 -1  O  VAL N 609   N  ALA N 626           
SHEET    3   G 5 VAL N 654  ASN N 662 -1  O  ASN N 657   N  THR N 612           
SHEET    4   G 5 GLN N 671  HIS N 676 -1  N  GLN N 671   O  PHE N 658           
SHEET    5   G 5 CYS N 588  LEU N 589  1  O  CYS N 588   N  HIS N 676           
CRYST1   64.660   85.460   83.370  90.00 112.03  90.00 P 1 21 1      2          
ORIGX1      1.000000  0.000000  0.000000        0.00000                         
ORIGX2      0.000000  1.000000  0.000000        0.00000                         
ORIGX3      0.000000  0.000000  1.000000        0.00000                         
SCALE1      0.015466  0.000000  0.006258        0.00000                         
SCALE2      0.000000  0.011701  0.000000        0.00000                         
SCALE3      0.000000  0.000000  0.012939        0.00000                         
ATOM      1  O5'  DT A4001       4.203  37.609  50.803  1.00 52.72           O  
ATOM      2  C5'  DT A4001       3.376  36.889  51.712  1.00 52.47           C  
ATOM      3  C4'  DT A4001       2.606  35.733  51.110  1.00 53.36           C  
ATOM      4  O4'  DT A4001       2.221  36.073  49.751  1.00 52.48           O  
ATOM      5  C3'  DT A4001       3.476  34.488  50.971  1.00 52.32           C  
ATOM      6  O3'  DT A4001       2.688  33.294  51.069  1.00 54.15           O  
ATOM      7  C2'  DT A4001       4.091  34.642  49.589  1.00 52.45           C  
ATOM      8  C1'  DT A4001       2.948  35.269  48.813  1.00 50.40           C  
ATOM      9  N1   DT A4001       3.295  36.105  47.620  1.00 46.82           N  
ATOM     10  C2   DT A4001       3.315  35.475  46.389  1.00 44.10           C  
ATOM     11  O2   DT A4001       3.079  34.285  46.233  1.00 42.08           O  
ATOM     12  N3   DT A4001       3.626  36.299  45.334  1.00 41.51           N  
ATOM     13  C4   DT A4001       3.919  37.648  45.372  1.00 43.02           C  
ATOM     14  O4   DT A4001       4.181  38.250  44.332  1.00 44.08           O  
ATOM     15  C5   DT A4001       3.891  38.244  46.680  1.00 43.79           C  
ATOM     16  C7   DT A4001       4.204  39.702  46.805  1.00 42.62           C  
ATOM     17  C6   DT A4001       3.581  37.455  47.732  1.00 45.67           C  
ATOM     18  P    DT A4002       3.429  31.863  51.098  1.00 57.02           P  
ATOM     19  OP1  DT A4002       2.467  30.721  51.156  1.00 54.83           O  

... < data omitted for brevity >

ATOM   3884  C   LEU J 318      27.065  26.434 104.638  1.00 69.24           C  
ATOM   3885  O   LEU J 318      25.978  26.660 105.215  1.00 71.58           O  
ATOM   3886  CB  LEU J 318      26.876  28.109 102.776  1.00 63.64           C  
ATOM   3887  CG  LEU J 318      27.060  29.547 102.286  1.00 64.34           C  
ATOM   3888  CD1 LEU J 318      25.692  30.140 101.986  1.00 63.18           C  
ATOM   3889  CD2 LEU J 318      27.769  30.403 103.324  1.00 64.33           C  
ATOM   3890  OXT LEU J 318      27.636  25.317 104.661  1.00 68.15           O  
TER    3891      LEU J 318                                                      
HETATM 3892  O   HOH A6001      15.403  36.729  37.482  1.00 32.21           O  
HETATM 3893  O   HOH A6002      27.779  37.839  46.346  1.00 14.74           O  
HETATM 3894  O   HOH A6003      24.852  40.609  44.995  1.00 19.97           O  
HETATM 3895  O   HOH A6011      63.157  40.356  39.359  1.00 65.68           O  
HETATM 3896  O   HOH A6013      56.916  34.826  37.981  1.00 49.34           O  

... < data omitted for brevity >

MASTER      322    0    0    4   28    0    0    6 3974    5    0   38          
END                                                                             

A.1.42. Pfam/Stockholm

Pfam multiple sequence alignment format is supported for input of sequences only. Pfam is a collection of multiple sequence alignments and hidden Markov models which covers many common protein domains and families. For further information see:

http://www.sanger.ac.uk/Software/Pfam/
# STOCKHOLM 1.0
#=GF ID   14-3-3
#=GF AC   PF00244
#=GF DE   14-3-3 protein
#=GF AU   Finn RD
#=GF AL   Clustalw
#=GF SE   Prosite
#=GF GA   25 25
#=GF TC   35.40 35.40
#=GF NC   19.10 19.10
#=GF TP   Domain
#=GF BM   hmmbuild -f HMM SEED
#=GF BM   hmmcalibrate --seed 0 HMM
#=GF RN   [1]
#=GF RM   95327195
#=GF RT   Structure of a 14-3-3 protein and implications for
#=GF RT   coordination of multiple signalling pathways. 
#=GF RA   Xiao B, Smerdon SJ, Jones DH, Dodson GG, Soneji Y, Aitken
#=GF RA   A, Gamblin SJ; 
#=GF RL   Nature 1995;376:188-191.
#=GF RN   [2]
#=GF RM   95327196
#=GF RT   Crystal structure of the zeta isoform of the 14-3-3
#=GF RT   protein. 
#=GF RA   Liu D, Bienkowska J, Petosa C, Collier RJ, Fu H, Liddington
#=GF RA   R; 
#=GF RL   Nature 1995;376:191-194.
#=GF RN   [3]
#=GF RM   96182649
#=GF RT   Interaction of 14-3-3 with signaling proteins is mediated
#=GF RT   by the recognition of phosphoserine. 
#=GF RA   Muslin AJ, Tanner JW, Allen PM, Shaw AS; 
#=GF RL   Cell 1996;84:889-897.
#=GF RN   [4]
#=GF RM   97424374
#=GF RT   The 14-3-3 protein binds its target proteins with a common
#=GF RT   site located towards the C-terminus. 
#=GF RA   Ichimura T, Ito M, Itagaki C, Takahashi M, Horigome T,
#=GF RA   Omata S, Ohno S, Isobe T 
#=GF RL   FEBS Lett 1997;413:273-276.
#=GF RN   [5]
#=GF RM   96394689
#=GF RT   Molecular evolution of the 14-3-3 protein family. 
#=GF RA   Wang W, Shakes DC 
#=GF RL   J Mol Evol 1996;43:384-398.
#=GF RN   [6]
#=GF RM   96300316
#=GF RT   Function of 14-3-3 proteins. 
#=GF RA   Jin DY, Lyu MS, Kozak CA, Jeang KT 
#=GF RL   Nature 1996;382:308-308.
#=GF DR   PROSITE; PDOC00633;
#=GF DR   SMART; 14_3_3;
#=GF DR   PRINTS; PR00305;
#=GF DR   SCOP; 1a4o; fa;
#=GF DR   INTERPRO; IPR000308;
#=GF DR   PDB; 1a37 A; 3; 228;
#=GF DR   PDB; 1a37 B; 3; 228;
#=GF DR   PDB; 1a38 A; 3; 228;
#=GF DR   PDB; 1a38 B; 3; 228;
#=GF DR   PDB; 1a4o A; 3; 228;
#=GF DR   PDB; 1a4o B; 3; 228;
#=GF DR   PDB; 1a4o C; 3; 228;
#=GF DR   PDB; 1a4o D; 3; 228;
#=GF DR   PDB; 1qja B; 3; 229;
#=GF DR   PDB; 1qja A; 3; 230;
#=GF DR   PDB; 1qjb A; 3; 232;
#=GF DR   PDB; 1qjb B; 3; 232;
#=GF SQ   148
#=GS O61131/11-251      AC O61131
#=GS 143L_ARATH/7-245   AC P48349
#=GS O49082/8-245       AC O49082

... < data omitted for brevity >

#=GS RA24_SCHPO/6-241   AC P42656
#=GS 143B_HORVU/9-246   AC Q43470
#=GS 143N_ARATH/5-242   AC Q96300
#=GS 143Z_HUMAN/3-236 DR PDB; 1a37 A; 3; 228;
#=GS 143Z_HUMAN/3-236 DR PDB; 1a37 B; 3; 228;
#=GS 143Z_HUMAN/3-236 DR PDB; 1a38 A; 3; 228;
#=GS 143Z_HUMAN/3-236 DR PDB; 1a38 B; 3; 228;
#=GS 143Z_HUMAN/3-236 DR PDB; 1a4o A; 3; 228;
#=GS 143Z_HUMAN/3-236 DR PDB; 1a4o B; 3; 228;
#=GS 143Z_HUMAN/3-236 DR PDB; 1a4o C; 3; 228;
#=GS 143Z_HUMAN/3-236 DR PDB; 1a4o D; 3; 228;
#=GS 143Z_HUMAN/3-236 DR PDB; 1qja B; 3; 229;
#=GS 143Z_HUMAN/3-236 DR PDB; 1qja A; 3; 230;
#=GS 143Z_HUMAN/3-236 DR PDB; 1qjb A; 3; 232;
#=GS 143Z_HUMAN/3-236 DR PDB; 1qjb B; 3; 232;
O61131/11-251                RSDCTYRSKLAEQAERYDEMADAMRTLVEQCVnn.......dkdELTVEERNLLSVAYKNAVGARRASWRIISSVEQKEMSKA.NVHNKNIAATYRKKVEEELNNIC.QDILN.LLTKKLIPNT..SESESKVFYYKMKGDYYRYISEFS.CDE.GKKEASNFAQEAYQKATDIAENELPSTHPIRLGLALNYSVFFY..EILNQPHQACEMAKRAF...DDAITEFDNV..SEDS..YKDSTLI.MQLLRDNLTLWTSDLQGDQ
O61132/1-232                 ---------LAEQAERYDEMADAMRTLVEQCVnn.......dkdELTVEERNLLSVAYKNAVGARRASWRIISSVEQKEMSKA.NVHNKNVAATYRKKVEEELNNIC.QDILN.LLTKKLIPNT..SESESKVFYYKMKGDYYRYISEFS.CDE.GKKEASNFAQEAYQKDTDIAENELPSTHPIRLGLALNYSVFFY..EILNQLHQACEMAKRAF...DDAITEFDNV..SEDS..YKDSTLI.MQLLRDNLTLWTSDLQGDQ
O96436/9-256                 REEHVYRAKLAEQAERYDEMAEAMKNLVENCLdqnnsppgakgdELTVEERNLLSVAYKNAVGARRASWRIISSVEQKEANRN.HMANKALAASYRQKVENELNKIC.QEILT.LLTDKLLPRT..TDSESRVFYFKMKGDYYRYISEFS.NEE.GKKASAEQAEESYKRATDTAEAELPSTHPIRLGLALNYSVFYY..EILNQPQKACEMAKLAF...DDAITEFDSV..SEDS..YKDSTLI.MQLLRDNLTLWTSDLQTQE
O60955/9-251                 RDEYVYKAKLAEQAERYDEMAEAMKNLVENCLdeq.....qpkdELSVEERNLLSVAYKNAVGARRASWRIISSVEQKELSKQ.HMQNKALAAEYRQKVEEELNKIC.HDILQ.LLTDKLIPKT..SDSESKVFYYKMKGDYYRYISEFS.GEE.GKKQAADQAQESYQKATETAEAELPSTHPIRLGLALNYSVFFY..EILNLPQQACEMAKRAF...DDAITEFDNV..SEDS..YKDSTLI.MQLLRDNLTLWTSDLQADQ
1433_NEOCA/9-251             RDEYVYKAKLAEQAERYDEMAEAMKNLVENCLdeq.....qpkdELSVEERNLLSVAYKNAVGARRASWRIISSVEQKELSKQ.HMQNKALAAEYRQKVEEELNKIC.HDILQ.LLTDKLIPKT..SDSESKVFYYKMKGDYYRYISEFS.GEE.GKKQAADQAQESYQKATETAEGHSPATHPIRLGLALNYSVFFY..EILNLPQQACEMAKRAF...DDAITEFDNV..SEDS..YKDSTLI.MQLLRDNLTLWTSDLQADQ
Q21539/1-97                  --------------------------------............---------------------------------------.-----------------------.-----.----------..-----------MVADHFRYLVQYD.-DI.NREEHAHKSRIAYQEALGIAKDKMQPTHPIRLGLALNASALNF..DVLNLPKEANEIAQSAL...DSAHRELEKMksSLDS..YDISNL-.-------------------
O65165/1-21                  --------------------------------............---------------------------------------.-----------------------.-----.----------..------------------------.---.-------------------------------------------..-----------------...----------..----..-----LI.MQLLRDNLTLWTSDMQEDG
Q9U491/4-239                 REALVYRAKLAEQLERYDEMVDAMKEVVEMAE............ELTVEERNLLSVAYKNVIGSRRSSWRVFSAVEQTEGNRG.NAEKQACAKKFREVLESELDRVS.KDILE.LIDKYLIKSA..TKSDSKVFYLKMKGDYFRYMAEFS.VDP.QRKKAAEESNKAYQEASEIAATQLFPTHPIRLGLALNYSVYFY..EIMNDPDEACRLAQAAF...DDAIAKLDQL..SEES..YKDSTLI.MQLLRDNLTLWTSDPERDD
1433_XENLA/1-227             -------AKLSEQAERYDDMAASMKAVTELGA............ELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTEG--.NDKRQQMAREYREKVETELQDIC.KDVLD.LLDRFLVPNA..TPPESKVFYLKMKGDYYRYLSEVA.SGD.SKQETVASSQQAYQEAFEISKSEMQPTHPIRLGLALNFSVFYY..EILNSPEKACSLAKSAF...DEAIRELDTL..NEES..YKDSTLI.MQLLRDNLTLWTSENQGEE
143B_BOVIN/4-237             KSELVQKAKLAEQAERYDDMAAAMKAVTEQGH............ELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTER--.NEKKQQMGKEYREKIEAELQDIC.NDVLQ.LLDKYLIPNA..TQPESKVFYLKMKGDYFRYLSEVA.SGD.NKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYY..EILNSPEKACSLAKTAF...DEAIAELDTL..NEES..YKDSTLI.MQLLRDNLTLWTSENQGDE

... < data omitted for brevity >

Q42058/7-112                 RDTFVYLAKLSEXAERYEEMVESMKSVAKLNV............DLTVEERNLLSVGYKNVIGSRRASWRIFSSIEQKEAVKG.NDXNVKRIKEYMEKVELELSNIC.IDIMS.VLDEHLI---..------------------------.---.-------------------------------------------..-----------------...----------..----..-------.-------------------
1433_ENTHI/1-236             SEDCVFLSKLAEQSERYDEMVQYMKQVAALNT............ELSVEERNLLSVAYKNVIGSRRASWRIITSLEQKEQAKG.NDKHVEIIKGYRAKIEDELAKYC.DDVLK.VIKENLLPNA..STSESKVFYKKMEGDYYRYYAEFT.VDE.KRQEVADKSLAAYTEATEISNADLAPTHPIRLGLALNFSVFYY..EIMNDADKACQLAKQAF...DDSIAKLDEV..PESS..YKDSTLI.MQLLRDNLTLWTSDTADEE
1431_ENTHI/4-239             REDCVYTAKLAEQSERYDEMVQCMKQVAEMEA............ELSIEERNLLSVAYKNVIGAKRASWRIISSLEQKEQAKG.NDKHVEIIKGYRAKIEKELSTCC.DDVLK.VIQENLLPKA..STSESKVFFKKMEGDYYRYFAEFT.VDE.KRKEVADKSLAAYTEATEISNAELAPTHPIRLGLALNFSVFYF..EIMNDADKACQLAKQAF...DDAIAKLDEV..PENM..YKDSTLI.MQLLRDNLTLWTSDACDEE
1432_ENTHI/4-238             REDLVYLSKLAEQSERYEEMVQYMKQVAEMGT............ELSVEERNLISVAYKNVVGSRRASWRIISSLEQKEQAKG.NTQRVELIKTYRAKIEQELSQKC.DDVLK.IITEFLLKNS..TSIESKVFFKKMEGDYYRYYAEFT.VDE.KRKEVADKSLAAYQEATDTA-ASLVPTHPIRLGLALNFSVFYY..QIMNDADKACQLAKEAF...DEAIQKLDEV..PEES..YKESTLI.MQLLRDNLTLWTSDMGDDE
Q9UAH0/5-239                 REELIYMAKIAEQTERFEDMLEYMKKVVQTGQ............ELSVEERNLLSVAYKNTVGSRRSAWRSISAIQQKEESKG.S-KHLDLLTNQKKKIETELNLYC.DDILK.LLNDFLIKNA..TNAEAQVFFLKMKGDYYRYIAEYA.QGD.EHKKAADGALDSYNKACEIANSELRPTHPIRLGLALNFSVFHY..EVLNDPSKACTLAKTAF...DEAIGDIERI..QEDQ..YKDATTI.MQLIRDNLTLWTSEFQDDA
Q9XYW8/1-233                 -----YLAMLAEQCSRYKEMVQFLEDMVKQRD...........kDLNSDERNLLSIAYKNSISGGRSAVRTIMAYEAKEKKKE.NSTFLPYITEYKKQVEDELTKLC.QGVLK.TTDEQLLKKA..EDDEAKVFYIKMKGDYNRYIAEYA.EGD.LKKQVSDDALKAYDEATEIA-KTLPVLNPIALGLALNFSVFYY..EVINDHKKAIEIAKAAV...EKADKELPNI..DEDAdeNRDTVSI.YNLLKENLDMWVSEEEGDQ
O61173/2-194                 -EKNVYLAMLAEQCSRYKEMVQFLEDMVKQRD...........kDLNSDERNLLSIAYKNSISGGRSAVRTIMAYEAKEKKKE.NSTFLPYITEYKKQVEDELTKLC.QGVLK.TTDEQLLKKA..EDDEAKVFYIKMKGDYNRYIAEYA.EGD.LKKQVSDDALKAYDEATEIA-KTLPVLNPIALGLALNFSVFYY..EVINDHKKAIEIAKA--...----------..----..-------.-------------------
Q9XYW7/4-229                 -EKQVYLAMLAEQCSRYEDMMTFLEDMVKAKA...........eDLSSDERNLLSIAYKNTISLDRQAIRTLLAYESKEAKKA.ESPYLDYIKEYKAKVQKELEDLC.NKINR.TIDDNLLPKA..TTDEAKVFYHKMKGDYCRYIAENV.DGD.TKKKYSDEGLAAYNAALEAA-KNIDYKNPVKLGLALNLSVFYY..EVVGNKDEACKLAEDTLsksKEALNGADE-..EEDE..VKDAMSI.VNLLEENL-----------
Q9SFK4/1-214                 -----------------------MRKVCELDI............ELSEEERDLLTTGYKNVMEAKRVSLRVISSIEKMEDSKG.NDQNVKLIKGQQEMVKYEFFNVC.NDILS.LIDSHLIPST.tTNVESIVLFNRVKGDYFRYMAEFG.SDA.ERKENADNSLDAYKVAMEMAENSLAPTNMVRLGLALNFSIFNY..EIHKSIESACKLVKKAY...DEAITELDGL..DKNI..CEESMYI.IEMLKYNLSTWTSGDGNGN
Q9XZV0/2-235                 KEELLNRCKLNDLIENYGEMFEYLKELSHIKI............DLQPDELDLITRCTKCYIGHKRGQYRKILTLIDKDKIVD.NQKNSALLEILRKKLSEEILLLC.NSTIE.LSQNFLNNNV..FPKKTQLFFTKIIADHYRYIYEIN.GKE.DIKLKAKEYYE--KGLQTIKTCKYNSTETAYLTFYLNYSVFLH..DTMRNTEESIKVSKACL...YEALKDTEDI..VDNS..QKDIVLL.CQMLKDNISLWKTETNEDN
#=GC SS_cons                 HHHHHHHHHHHHHTTCHHHHHHHHHHHHTTSC............CCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCTTT--.CCHHHHHHHHHHHHHHHHHHHHH.HHHHH.HHHHTTTTCC..CSCHHHHHHHHHHHHHHHHHHHHC.CSC.HHHHHHHHHHHHHHHHHHHHHCHCCTTCHCHHHHHHHHHHHHC..HTSCCHHHCHHHHHHHH...HHHHTTCGGC..CTTT..HHHHHHH.HHHHHHHHHHCTCCCXXXX
#=GC SA_cons                 26310320300350512510050022003352............4045500400120033002310402420152179179--.38752510440144014203510.43002.0035201642..754403000010100011100201.867.7465125302500340252067635113122100001001127..31372485135106412...5415867932..3994..6651462.142043126627759XXXX
//

A.1.43. PHYLIP (interleaved)

PHYLIP multiple sequence alignment format (interleaved) is the format used in the Phylip package for phylogenetic analysis. This format imposes a restriction of 10 characters on all sequence names, which can cause problems when converting data from other formats. For further information see:

http://evolution.genetics.washington.edu/phylip.html
 4 131
IXI_234   TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT
IXI_235   TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT
IXI_236   TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT
IXI_237   TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT

          GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSRSAG
          GGWKTCSGTC TTSTSTRHRG RSGW------ ----RASRKS MRAACSRSAG
          GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSR--G
          GGYKTCSGTC TTSTSTRHRG RSGYSARTTT AACLRASRKS MRAACSR--G

          SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E
          SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E
          SRPPRFAPPL MSSCITSTTG PPPPAGDRSH E
          SRPNRFAPTL MSSCLTSTTG PPAYAGDRSH E

A.1.44. PHYLIP (non-interleaved)

PHYLIP multiple sequence alignment format (non-interleaved) is the format used in older versions of the Phylip package for phylogenetic analysis. This format imposes a restriction of 10 characters on all sequence names, which can cause problems when converting data from other formats. For further information see:

http://evolution.genetics.washington.edu/phylip.html

The non-interleaved format was used in Phylip version 3.2. It is also called phylip3 for back compatibility with earlier EMBOSS versions.

4 131
IXI_234   TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT
          GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSRSAG
          SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E
IXI_235   TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT
          GGWKTCSGTC TTSTSTRHRG RSGW------ ----RASRKS MRAACSRSAG
          SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E
IXI_236   TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT
          GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSR--G
          SRPPRFAPPL MSSCITSTTG PPPPAGDRSH E
IXI_237   TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT
          GGYKTCSGTC TTSTSTRHRG RSGYSARTTT AACLRASRKS MRAACSR--G
          SRPNRFAPTL MSSCLTSTTG PPAYAGDRSH E

A.1.45. Raw

Raw format is similar to "text/plain" format except that it removes any whitespace or digits, accepts only alphabetic characters and rejects anything else. Thus it is generally safer to use this format than "text/plain" format. If the file contains digits, spaces or TAB characters they are removed. Any other non-alphabetic characters (for example, punctuation marks) will cause the file to be rejected as erroneous. Gap characters ('-') and translated STOP codon characters ('*') are, however, legal.

ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc
gccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggt
gaccggccaggaaacggtcgcccagatcaaggctcatgtagcctcactgg
agggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctg
gaggatgaggccactctgggccagtgcggggtggaggccctgactaccct
ggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaag
aagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcg
ctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgcca
actcttaagtcttttgtaattctggctttctctaataaaaaagccactta
gttcagtcaaaaaaaaaa

A.1.46. RefseqP

RefseqP entry format supports all the fields in the latest NCBI protein reference sequence database format. RefseqP is a database maintained by NCBI. For more information see:

http://www.ncbi.nlm.nih.gov/RefSeq/

Where RefseqP format is used for output, fields for which data are available will be completed and others with no information will be omitted. Exactly what data will be present depends very much on the source of input sequences. The EMBOSS command line allows data, such as accession numbers, to be provided if they do not form part of the input sequence data (see Section 6.4, “Datatype-specific Command Line Qualifiers”).

GenPept currently uses the same parser as the closely related RefseqP format so these can be used interchangeably until the original formats diverge.

LOCUS       NP_001988                133 aa            linear   PRI 29-MAR-2009
DEFINITION  ubiquitin-like protein fubi and ribosomal protein S30 precursor
            [Homo sapiens].
ACCESSION   NP_001988
VERSION     NP_001988.1  GI:4503659
DBSOURCE    REFSEQ: accession NM_001997.3
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (residues 1 to 133)
  AUTHORS   Yu,Y., Ji,H., Doudna,J.A. and Leary,J.A.
  TITLE     Mass spectrometric analysis of the human 40S ribosomal subunit:
            native and HCV IRES-bound complexes
  JOURNAL   Protein Sci. 14 (6), 1438-1446 (2005)
   PUBMED   15883184
REFERENCE   2  (residues 1 to 133)
  AUTHORS   Andersen,J.S., Lam,Y.W., Leung,A.K., Ong,S.E., Lyon,C.E.,
            Lamond,A.I. and Mann,M.
  TITLE     Nucleolar proteome dynamics
  JOURNAL   Nature 433 (7021), 77-83 (2005)
   PUBMED   15635413
REFERENCE   3  (residues 1 to 133)
  AUTHORS   Mourtada-Maarabouni,M., Kirkham,L., Farzaneh,F. and Williams,G.T.
  TITLE     Regulation of apoptosis by fau revealed by functional expression
            cloning and antisense expression
  JOURNAL   Oncogene 23 (58), 9419-9426 (2004)
   PUBMED   15543234
  REMARK    GeneRIF: Overexpression of fau in the sense orientation induces
            cell death, which is inhibited both by Bcl-2 and by inhibition of
            caspases, in line with its proposed role in apoptosis
REFERENCE   4  (residues 1 to 133)
  AUTHORS   Kapp,L.D. and Lorsch,J.R.
  TITLE     The molecular mechanics of eukaryotic translation
  JOURNAL   Annu. Rev. Biochem. 73, 657-704 (2004)
   PUBMED   15189156
  REMARK    Review article
REFERENCE   5  (residues 1 to 133)
  AUTHORS   Rossman,T.G., Visalli,M.A. and Komissarova,E.V.
  TITLE     fau and its ubiquitin-like domain (FUBI) transforms human
            osteogenic sarcoma (HOS) cells to anchorage-independence
  JOURNAL   Oncogene 22 (12), 1817-1821 (2003)
   PUBMED   12660817
  REMARK    GeneRIF: role in transforming osteogenic sarcoma cells to
            anchorage-independence
REFERENCE   6  (residues 1 to 133)
  AUTHORS   Vladimirov,S.N., Ivanov,A.V., Karpova,G.G., Musolyamov,A.K.,
            Egorov,T.A., Thiede,B., Wittmann-Liebold,B. and Otto,A.
  TITLE     Characterization of the human small-ribosomal-subunit proteins by
            N-terminal and internal sequencing, and mass spectrometry
  JOURNAL   Eur. J. Biochem. 239 (1), 144-149 (1996)
   PUBMED   8706699
REFERENCE   7  (residues 1 to 133)
  AUTHORS   Wool,I.G., Chan,Y.L. and Gluck,A.
  TITLE     Structure and evolution of mammalian ribosomal proteins
  JOURNAL   Biochem. Cell Biol. 73 (11-12), 933-947 (1995)
   PUBMED   8722009
  REMARK    Review article
REFERENCE   8  (residues 1 to 133)
  AUTHORS   Michiels,L., Van der Rauwelaert,E., Van Hasselt,F., Kas,K. and
            Merregaert,J.
  TITLE     fau cDNA encodes a ubiquitin-like-S30 fusion protein and is
            expressed as an antisense sequence in the Finkel-Biskis-Reilly
            murine sarcoma virus
  JOURNAL   Oncogene 8 (9), 2537-2546 (1993)
   PUBMED   8395683
REFERENCE   9  (residues 1 to 133)
  AUTHORS   Kas,K., Schoenmakers,E., van de Ven,W., Weber,G., Nordenskjold,M.,
            Michiels,L., Merregaert,J. and Larsson,C.
  TITLE     Assignment of the human FAU gene to a subregion of chromosome 11q13
  JOURNAL   Genomics 17 (2), 387-392 (1993)
   PUBMED   8406491
REFERENCE   10 (residues 1 to 133)
  AUTHORS   Kas,K., Michiels,L. and Merregaert,J.
  TITLE     Genomic structure and expression of the human fau gene: encoding
            the ribosomal protein S30 fused to a ubiquitin-like protein
  JOURNAL   Biochem. Biophys. Res. Commun. 187 (2), 927-933 (1992)
   PUBMED   1326960
COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff in
            collaboration with Francesco Amaldi. The reference sequence was
            derived from BP296770.1, X65923.1 and AK026639.1.
            
            Summary: This gene is the cellular homolog of the fox sequence in
            the Finkel-Biskis-Reilly murine sarcoma virus (FBR-MuSV). It
            encodes a fusion protein consisting of the ubiquitin-like protein
            fubi at the N terminus and ribosomal protein S30 at the C terminus.
            It has been proposed that the fusion protein is
            post-translationally processed to generate free fubi and free
            ribosomal protein S30. Fubi is a member of the ubiquitin family,
            and ribosomal protein S30 belongs to the S30E family of ribosomal
            proteins. Whereas the function of fubi is currently unknown,
            ribosomal protein S30 is a component of the 40S subunit of the
            cytoplasmic ribosome. Pseudogenes derived from this gene are
            present in the genome. Similar to ribosomal protein S30, ribosomal
            proteins S27a and L40 are synthesized as fusion proteins with
            ubiquitin. [provided by RefSeq].
            
            Publication Note:  This RefSeq record includes a subset of the
            publications that are available for this gene. Please see the
            Entrez Gene record to access additional publications.
FEATURES             Location/Qualifiers
     source          1..133
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="11"
                     /map="11q13"
     Protein         1..133
                     /product="ubiquitin-like protein fubi and ribosomal
                     protein S30 precursor"
                     /note="FBR-MuSV-associated ubiquitously expressed;
                     ubiquitin-like-S30 fusion protein; 40S ribosomal protein
                     S30; ubiquitin-like protein fubi; FAU-encoded
                     ubiquitin-like protein; Monoclonal nonspecific suppressor
                     factor beta; Finkel-Biskis-Reilly murine sarcoma virus
                     (FBR-MuSV) ubiquitously expressed (fox derived)"
                     /calculated_mol_wt=14259
     Region          1..74
                     /region_name="Fubi"
                     /note="Fubi is a ubiquitin-like protein encoded by the fau
                     gene which has an  N-terminal ubiquitin-like domain (also
                     referred to as FUBI) fused to the ribosomal protein S30.
                     Fubi is thought to be a tumor suppressor protein and the
                     FUBI domain may act as a...; cd01793"
                     /db_xref="CDD:29195"
     mat_peptide     1..74
                     /product="ubiquitin-like protein fubi"
                     /calculated_mol_wt=7760
     Region          74..133
                     /region_name="Ribosomal_S30"
                     /note="Ribosomal protein S30; cl02062"
                     /db_xref="CDD:141357"
     mat_peptide     75..133
                     /product="ribosomal protein S30"
                     /calculated_mol_wt=6648
     CDS             1..133
                     /gene="FAU"
                     /gene_synonym="asr1; FAU1; FLJ22986; Fub1; Fubi; MNSFbeta;
                     RPS30"
                     /coded_by="NM_001997.3:108..509"
                     /db_xref="CCDS:CCDS8095.1"
                     /db_xref="GeneID:2197"
                     /db_xref="HGNC:3597"
                     /db_xref="HPRD:00002"
                     /db_xref="MIM:134690"
ORIGIN      
        1 mqlfvraqel htfevtgqet vaqikahvas legiapedqv vllagapled eatlgqcgve
       61 alttlevagr mlggkvhgsl aragkvrgqt pkvakqekkk kktgrakrrm qynrrfvnvv
      121 ptfgkkkgpn ans
//

A.1.47. SELEX

SELEX is an interleaved multiple sequence alignment format used by Sean Eddy's HMMER package. HMMER is a freely distributable implementation of profile HMM software for protein sequence analysis. SELEX format can store RNA secondary structure as part of the sequence annotation. For further information see:

http://hmmer.janelia.org/
#=SQ IXI_234 1.00 - - 0..0:0 -
#=SQ IXI_235 1.00 - - 0..0:0 -
#=SQ IXI_236 1.00 - - 0..0:0 -
#=SQ IXI_237 1.00 - - 0..0:0 -

IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235 TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236 TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237 TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT

IXI_234 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
IXI_235 GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAG
IXI_236 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR--G
IXI_237 GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR--G

IXI_234 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236 SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237 SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE

A.1.48. Staden (obsolete)

The format used by older versions of the Staden package. Staden is a package of programs for sequence handling and analysis that is particularly useful for analysis of sequence trace data and large scale sequence assembly projects. For further information see:

http://staden.sourceforge.net/

Staden stores single sequencing experiment reads in a format derived from EMBL. All EMBL tags are allowed, plus many extras. Unusually, the extra tags are allowed to continue beyond the // line which only marks the end of the sequence. The EX experiment line is used to create a sequence description. Accuracy values are stored, or at least the largest value for each sequence position. Optional comments may be inserted at any position within the sequence. When EMBOSS reads Staden format, it recognizes a comment at the top of the sequence but considers comments inside the sequence as part of the sequence. In addition, some alternative nucleotide ambiguity codes are used and must be converted.

Staden format is now obsolete: the latest version of the Staden package does not support it. EMBOSS retains it to accept old data files. Use the "Staden experiment" format (see below) with the latest Staden version.

<X65923---->
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

A.1.49. Strider

Format used by DNA Strider, which is a molecular sequence editor with integrated tools for DNA and protein sequence analysis. For further information see:

http://nar.oxfordjournals.org/cgi/reprint/16/5/1829
; ### from DNA Strider ;-)
; DNA sequence  HSFAU, 518 bases
;
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc
gccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggt
gaccggccaggaaacggtcgcccagatcaaggctcatgtagcctcactgg
agggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctg
gaggatgaggccactctgggccagtgcggggtggaggccctgactaccct
ggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaag
aagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcg
ctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgcca
actcttaagtcttttgtaattctggctttctctaataaaaaagccactta
gttcagtcaaaaaaaaaa
//
; ### from DNA Strider ;-)
; DNA sequence  HSFAU1, 2016 bases
;
ctaccattttccctctcgattctatatgtacactcgggacaagttctcct
gatcgaaaacggcaaaactaaggccccaagtaggaatgccttagttttcg
gggttaacaatgattaacactgagcctcacacccacgcgatgccctcagc
tcctcgctcagcgctctcaccaacagccgtagcccgcagccccgctggac
accggttctccatccccgcagcgtagcccggaacatggtagctgccatct
ttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgccccg
tcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaaggg
gcggagctaggactgccttgggcggtacaaatagcagggaaccgcgcggt
cgctcagcagtgacgtgacacgcagcccacggtctgtactgacgcgccct
cgcttcttcctctttctcgactccatcttcgcggtagctgggaccgccgt
tcaggtaagaatggggccttggctggatccgaagggcttgtagcaggttg
gctgcggggtcagaaggcgcggggggaaccgaagaacggggcctgctccg
tggccctgctccagtccctatccgaactccttgggaggcactggccttcc
gcacgtgagccgccgcgaccaccatcccgtcgcgatcgtttctggaccgc
tttccactcccaaatctcctttatcccagagcatttcttggcttctctta
caagccgtcttttctttactcagtcgccaatatgcagctctttgtccgcg
cccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcccag
atcaaggtaaggctgcttggtgcgccctgggttccattttcttgtgctct
tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtag
gctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgct
cctggcaggcgcgcccctggaggatgaggccactctgggccagtgcgggg
tggaggccctgactaccctggaagtagcaggccgcatgcttggaggtgag
tgagagaggaatgttctttgaagtaccggtaagcgtctagtgagtgtggg
gtgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc
tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttg
aacacagacgtccatgggagactgagccagagtgtagttgtatttcagtc
acatcacgagatcctagtctggttatcagcttccacactaaaaattaggt
cagaccaggccccaaagtgctctataaattagaagctggaagatcctgaa
atgaaacttaagatttcaaggtcaaatatctgcaactttgttctcattac
ctattgggcgcagcttctctttaaaggcttgaattgagaaaagaggggtt
ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttccca
ctacaggtaaagtccatggttccctggcccgtgctggaaaagtgagaggt
cagactcctaaggtgagtgagagtattagtggtcatggtgttaggacttt
ttttcctttcacagctaaaccaagtccctgggctcttactcggtttgcct
tctccctccctggagatgagcctgagggaagggatgctaggtgtggaaga
caggaaccagggcctgattaaccttcccttctccaggtggccaaacagga
gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaacc
ggcgctttgtcaacgttgtgcccacctttggcaagaagaagggccccaat
gccaactcttaagtcttttgtaattctggctttctctaataaaaaagcca
cttagttcagtcatcgcattgtttcatctttacttgcaaggcctcaggga
gaggtgtgcttctcgg
//

A.1.50. SwissProt

SwissProt entry format, including all fields in the latest format. UniProtKB / SwissProt is a curated protein sequence database with high quality annotation on the protein function, domain structure, post-translational modifications, variants, etc. It has a minimal level of redundancy and high level of integration with other databases. For further information see:

http://www.expasy.org/sprot/

Where SwissProt format is used for output, fields for which data are available will be completed and others with no information will omitted. Exactly what data will be present depends very much on the source of input sequences. The EMBOSS command line allows data, such as accession numbers, to be provided if they do not form part of the input sequence data (see Section 6.4, “Datatype-specific Command Line Qualifiers”).

ID   UBR5_RAT                Reviewed;         920 AA.
AC   Q62671;
DT   01-NOV-1997, integrated into UniProtKB/Swiss-Prot.
DT   05-MAR-2002, sequence version 2.
DT   16-JUN-2009, entry version 67.
DE   RecName: Full=E3 ubiquitin-protein ligase UBR5;
DE            EC=6.3.2.-;
DE   AltName: Full=E3 ubiquitin-protein ligase, HECT domain-containing 1;
DE   AltName: Full=Hyperplastic discs protein homolog;
DE   AltName: Full=100 kDa protein;
DE   Flags: Fragment;
GN   Name=Ubr5; Synonyms=Dd5, Edd, Edd1, Hyd;
OS   Rattus norvegicus (Rat).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi;
OC   Muroidea; Muridae; Murinae; Rattus.
OX   NCBI_TaxID=10116;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [MRNA].
RC   STRAIN=Wistar; TISSUE=Testis;
RX   MEDLINE=92253337; PubMed=1533713; DOI=10.1093/nar/20.7.1471;
RA   Mueller D., Rehbein M., Baumeister H., Richter D.;
RT   "Molecular characterization of a novel rat protein structurally
RT   related to poly(A) binding proteins and the 70K protein of the U1
RT   small nuclear ribonucleoprotein particle (snRNP).";
RL   Nucleic Acids Res. 20:1471-1475(1992).
RN   [2]
RP   ERRATUM.
RA   Mueller D., Rehbein M., Baumeister H., Richter D.;
RL   Nucleic Acids Res. 20:2624-2624(1992).
RN   [3]
RP   IDENTIFICATION OF PROBABLE FRAMESHIFT.
RX   MEDLINE=99153743; PubMed=10030672; DOI=10.1038/sj.onc.1202249;
RA   Callaghan M.J., Russell A.J., Woollatt E., Sutherland G.R.,
RA   Sutherland R.L., Watts C.K.W.;
RT   "Identification of a human HECT family protein with homology to the
RT   Drosophila tumor suppressor gene hyperplastic discs.";
RL   Oncogene 17:3479-3491(1998).
RN   [4]
RP   TISSUE SPECIFICITY, AND DEVELOPMENTAL STAGE.
RX   PubMed=12239083; DOI=10.1210/en.2002-220262;
RA   Oughtred R., Bedard N., Adegoke O.A.J., Morales C.R., Trasler J.,
RA   Rajapurohitam V., Wing S.S.;
RT   "Characterization of rat100, a 300-kilodalton ubiquitin-protein ligase
RT   induced in germ cells of the rat testis and similar to the Drosophila
RT   hyperplastic discs gene.";
RL   Endocrinology 143:3740-3747(2002).
CC   -!- FUNCTION: E3 ubiquitin-protein ligase which is a component of the
CC       N-end rule pathway. Recognizes and binds to proteins bearing
CC       specific amino-terminal residues that are destabilizing according
CC       to the N-end rule, leading to their ubiquitination and subsequent
CC       degradation (By similarity). May be involved in maturation and/or
CC       post-transcriptional regulation of mRNA. May play a role in
CC       control of cell cycle progression. May have tumor suppressor
CC       function. Regulates DNA topoisomerase II binding protein (TopBP1)
CC       for the DNA damage response. Plays an essential role in
CC       extraembryonic development (By similarity).
CC   -!- PATHWAY: Protein modification; protein ubiquitination.
CC   -!- SUBUNIT: Binds TOPBP1 (By similarity).
CC   -!- SUBCELLULAR LOCATION: Nucleus (By similarity).
CC   -!- TISSUE SPECIFICITY: Highest levels found in testis. Also present
CC       in liver, kidney, lung and brain.
CC   -!- DEVELOPMENTAL STAGE: In early postnatal life, expression in the
CC       testis increases to reach a maximum around day 28.
CC   -!- PTM: Phosphorylated upon DNA damage, probably by ATM or ATR (By
CC       similarity).
CC   -!- MISCELLANEOUS: A cysteine residue is required for ubiquitin-
CC       thioester formation.
CC   -!- SIMILARITY: Contains 1 HECT (E6AP-type E3 ubiquitin-protein
CC       ligase) domain.
CC   -!- SIMILARITY: Contains 1 PABC domain.
CC   -!- SEQUENCE CAUTION:
CC       Sequence=CAA45756.1; Type=Frameshift; Positions=30;
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------
DR   EMBL; X64411; CAA45756.1; ALT_FRAME; mRNA.
DR   IPI; IPI00207158; -.
DR   PIR; S22659; S22659.
DR   UniGene; Rn.54812; -.
DR   HSSP; O95071; 1I2T.
DR   SMR; Q62671; 515-575.
DR   PhosphoSite; Q62671; -.
DR   Ensembl; ENSRNOG00000006816; Rattus norvegicus.
DR   RGD; 621236; Dd5.
DR   HOVERGEN; Q62671; -.
DR   ArrayExpress; Q62671; -.
DR   GermOnline; ENSRNOG00000006816; Rattus norvegicus.
DR   GO; GO:0005634; C:nucleus; IEA:UniProtKB-SubCell.
DR   GO; GO:0003723; F:RNA binding; IEA:InterPro.
DR   GO; GO:0004842; F:ubiquitin-protein ligase activity; TAS:RGD.
DR   GO; GO:0019941; P:modification-dependent protein catabolic pr...; IEA:UniProtKB-KW.
DR   GO; GO:0016567; P:protein ubiquitination; TAS:RGD.
DR   InterPro; IPR000569; HECT.
DR   InterPro; IPR002004; PABP_HYD.
DR   Gene3D; G3DSA:1.10.1900.10; PABP_HYD; 1.
DR   Pfam; PF00632; HECT; 1.
DR   Pfam; PF00658; PABP; 1.
DR   SMART; SM00119; HECTc; 1.
DR   SMART; SM00517; PolyA; 1.
DR   PROSITE; PS50237; HECT; 1.
DR   PROSITE; PS51309; PABC; 1.
PE   2: Evidence at transcript level;
KW   Ligase; Nucleus; Phosphoprotein; Ubl conjugation pathway.
FT   CHAIN        <1    920       E3 ubiquitin-protein ligase UBR5.
FT                                /FTId=PRO_0000086933.
FT   DOMAIN      499    576       PABC.
FT   DOMAIN      583    920       HECT.
FT   COMPBIAS    108    119       Asp/Glu-rich (acidic).
FT   COMPBIAS    158    181       Pro-rich.
FT   COMPBIAS    451    470       Arg/Glu-rich (mixed charge).
FT   COMPBIAS    479    488       Arg/Asp-rich (mixed charge).
FT   COMPBIAS    610    621       Asp/Glu-rich (acidic).
FT   COMPBIAS    858    878       Pro-rich.
FT   ACT_SITE    889    889       Glycyl thioester intermediate (By
FT                                similarity).
FT   MOD_RES      91     91       Phosphothreonine (By similarity).
FT   MOD_RES     193    193       Phosphoserine (By similarity).
FT   MOD_RES     607    607       Phosphoserine (By similarity).
FT   NON_TER       1      1
SQ   SEQUENCE   920 AA;  103950 MW;  465771084536C3AA CRC64;
     ARRERMTARE EASLRTLEGR RRATLLSARQ GMMSARGDFL NYALSLMRSH NDEHSDVLPV
     LDVCSLKHVA YVFQALIYWI KAMNQQTTLD TPQLERKRTR ELLELGIDNE DSEHENDDDT
     SQSATLNDKD DESLPAETGQ NHPFFRRSDS MTFLGCIPPN PFEVPLAEAI PLADQPHLLQ
     PNARKEDLFG RPSQGLYSSS AGSGKCLVEV TMDRNCLEVL PTKMSYAANL KNVMNMQNRQ
     KKAGEDQSML AEEADSSKPG PSAHDVAAQL KSSLLAEIGL TESEGPPLTS FRPQCSFMGM
     VISHDMLLGR WRLSLELFGR VFMEDVGAEP GSILTELGGF EVKESKFRRE MEKLRNQQSR
     DLSLEVDRDR DLLIQQTMRQ LNNHFGRRCA TTPMAVHRVK VTFKDEPGEG SGVARSFYTA
     IAQAFLSNEK LPNLDCIQNA NKGTHTSLMQ RLRNRGERDR EREREREMRR SSGLRAGSRR
     DRDRDFRRQL SIDTRPFRPA SEGNPSDDPD PLPAHRQALG ERLYPRVQAM QPAFASKITG
     MLLELSPAQL LLLLASEDSL RARVEEAMEL IVAHGRENGA DSILDLGLLD SSEKVQENRK
     RHGSSRSVVD MDLDDTDDGD DNAPLFYQPG KRGFYTPRPG KNTEARLNCF RNIGRILGLC
     LLQNELCPIT LNRHVIKVLL GRKVNWHDFA FFDPVMYESL RQLILASQSS DADAVFSAMD
     LAFAVDLCKE EGGGQVELIP NGVNIPVTPQ NVYEYVRKYA EHRMLVVAEQ PLHAMRKGLL
     DVLPKNSLED LTAEDFRLLV NGCGEVNVQM LISFTSFNDE SGENAEKLLQ FKRWFWSIVE
     RMSMTERQDL VYFWTSSPSL PASEEGFQPM PSITIRPPDD QHLPTANTCI SRLYVPLYSS
     KQILKQKLLL AIKTKNFGFV
//

A.1.51. Text/Plain

Plain is the "no format" format: the entire file contents are read in as a sequence; the file must contain no annotation, comments or heading lines. Anything is acceptable in this format. This means that any character will be included in the sequence, even digits and punctuation. This format is not detetected automatically. Specify -sformat text only when you are sure that the input sequence file is correct and contains only what you want to be considered as your sequence. The safer "raw" format reads only sequence characters and rejects input with other data. The raw format can be detected automatically.

ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtc
gccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggt
gaccggccaggaaacggtcgcccagatcaaggctcatgtagcctcactgg
agggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctg
gaggatgaggccactctgggccagtgcggggtggaggccctgactaccct
ggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaag
aagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcg
ctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgcca
actcttaagtcttttgtaattctggctttctctaataaaaaagccactta
gttcagtcaaaaaaaaaa

A.1.52. Treecon

Format used by the Treecon package for the construction and drawing of evolutionary distance trees. For further information see:

http://bioinformatics.psb.ugent.be/software/details/TREECON
2016
HSFAU
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccctgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttctctaataaaaaagccacttagttcagtcaaaaaaaaaa
HSFAU1
ctaccattttccctctcgattctatatgtacactcgggacaagttctcctgatcgaaaacggcaaaactaaggccccaagtaggaatgccttagttttcggggttaacaatgattaacactgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacagccgtagcccgcagccccgctggacaccggttctccatccccgcagcgtagcccggaacatggtagctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgccccgtcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaaggggcggagctaggactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgtgacacgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccatcttcgcggtagctgggaccgccgttcaggtaagaatggggccttggctggatccgaagggcttgtagcaggttggctgcggggtcagaaggcgcggggggaaccgaagaacggggcctgctccgtggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgtgagccgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatctcctttatcccagagcatttcttggcttctcttacaagccgtcttttctttactcagtcgccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccattttcttgtgctcttcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccctgactaccctggaagtagcaggccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctagtgagtgtggggtgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttctcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacgtccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtctggttatcagcttccacactaaaaattaggtcagaccaggccccaaagtgctctataaattagaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttgttctcattacctattgggcgcagcttctctttaaaggcttgaattgagaaaagaggggttctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaaagtccatggttccctggcccgtgctggaaaagtgagaggtcagactcctaaggtgagtgagagtattagtggtcatggtgttaggactttttttcctttcacagctaaaccaagtccctgggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctaggtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacaggagaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttctctaataaaaaagccacttagttcagtcatcgcattgtttcatctttacttgcaaggcctcagggagaggtgtgcttctcgg