[EMBOSS] Problems with GenBank indexing

Peter Rice pmr at ebi.ac.uk
Mon Apr 10 06:44:47 EDT 2006


Natalia Jimenez Lozano wrote:

> I was looking for an explanation to this behaviour and I've found that 
> skipped IDs correspond to CDS from genomic sequences and have this format:
> 
>  >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana]
> MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY...
>  >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana]
> MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS...

As Jon says, dbxfasta is a solution.

However, that is only a partial solution. The real problem is that these FASTA 
format sequences do indeed have duplicate IDs.

This is protein sequence data, so it is not GenBank - was this GenPept or some 
other database?

GenPept and other databases have been known to report "gb" or "emb" as the 
database for protein sequences!!!

A possible solution is to add a new ID format to dbifasta and dbxfasta that 
uses AAG13419 and AAF7986 as the ID and ignores the AC000348_16 part.

Hope this helps,

Peter




More information about the EMBOSS mailing list