[EMBOSS] Problems with GenBank indexing
jison at ebi.ac.uk
Fri Apr 7 08:02:50 EDT 2006
By default, dbifasta will index the ID name and the accession number (if present).
To index the Sequence Version, GI number and words in the description, you must
run dbifasta with the '-fields' qualifier, e.g. "-fields acc", "-fields sv acc"
etc. If you don't, you will not be able to retrieve by those fields. Please
dbifasta only retrieves the first of any duplicate entries. So far as I'm aware
dbxfasta can retrieve duplicate entries.
Does that help? Feel free to get back in touch.
> Hi everybody,
> I was trying to retrieve fasta protein sequences from GenBank by id
> using seqret but it was not possible for every id. However, retrieval by
> GI is allowed.
> Additionally, during the indexing process (dbifasta) I've obtained some
> errors like this one:
> Warning: Duplicate ID skipped: 'AC000348_16' All hits will point to
> first ID found
> I was looking for an explanation to this behaviour and I've found that
> skipped IDs correspond to CDS from genomic sequences and have this format:
> >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana]
> >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana]
> In the previous entries, when I try to retrieve one of them by the first
> identifier (gi), I can get both of them. When I try to do retrievals
> using the last identifier (AC000348_16), I only get the first one. But
> it's impossible to do retrievals by second identifier (AAG13419.1 and
> However, sequences with the following format can be well indexed:
> >gi|64029|emb|CAA23986.1| reading frame [Lophius americanus]
> MKMVSSSRLRCLLVLLLSLTASISCSFAGQRDSKLRLLLHRYPLQGSKQDMTRSALAELLLSDLLQGENE ...
> and these sequences can be well retrieved by first and second
> identifiers (64029 and CAA23986.1).
> Does anybody know how to solve these problems?
> Thanks in advance,
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
More information about the EMBOSS