[EMBOSS] Problems with GenBank indexing

Thu Apr 6 03:56:06 EDT 2006

Hi everybody,

I was trying to retrieve fasta protein sequences from GenBank by id 
using seqret but it was not possible for every id. However, retrieval by 
GI is allowed.

Additionally, during the indexing process (dbifasta) I've obtained some 
errors like this one:

Warning: Duplicate ID skipped: 'AC000348_16' All hits will point to 
first ID found

I was looking for an explanation to this behaviour and I've found that 
skipped IDs correspond to CDS from genomic sequences and have this format:

 >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana]
MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY...
 >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana]
MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS...

In the previous entries, when I try to retrieve one of them by the first 
identifier (gi), I can get both of them. When I try to do retrievals 
using the last identifier (AC000348_16), I only get the first one. But 
it's impossible to do retrievals by second identifier (AAG13419.1 and 
AAF79863.1).

However, sequences with the following format can be well indexed:

 >gi|64029|emb|CAA23986.1| reading frame [Lophius americanus]
MKMVSSSRLRCLLVLLLSLTASISCSFAGQRDSKLRLLLHRYPLQGSKQDMTRSALAELLLSDLLQGENE ...

and these sequences can be well retrieved by first and second 
identifiers (64029 and CAA23986.1).

Does anybody know how to solve these problems?
Thanks in advance,
Natalia