[EMBOSS] Seqret slowness.....
Richard.Rothery at ualberta.ca
Wed Oct 14 13:46:21 EDT 2009
I have been trying to update my sequence datasets using the seqret program.
Step 1 is that I blast my sequence against uniprot using the EXPASY server.
Unfortunately, because the recent explosion of duplicate data
("environmental samples"), it is now necessary to download 1-2K sequences
and then filter out the random "environmental" sample derived fragments etc.
Step 2 is assembling a list in gnumeric, exporting it as a multiline text
file of format "unpirot:accession" . Step 3 is using the command "seqret
@filename.txt". This is extraordinarily slow. It takes >12 hours to download
a fasta file containing 3K sequences. Is there a way of speeding this up? I
used to be able to download directly from EXPASY, but the site now only
allows about 200-odd sequences to be selected and downloaded at a time.
Note that filtering sequence sets is very fast with the program cd-hit. This
takes about 10 seconds on an old P4 machine to remove sequences from the set
with >90% identity to any other, for example.
I do not have the resources to install and index local databases.
Richard A. Rothery, Ph.D.
Membrane Protein Research Group,
Department of Biochemistry, University of Alberta,
Edmonton T6G 2H7
Ph. 780-492-2229 Fax. 780-492-0886
More information about the EMBOSS