[EMBOSS] needle question!
jison at ebi.ac.uk
Mon Feb 12 17:48:00 EST 2007
If I understand you correctly, you want 'N' bases to be totally
"invisible" during the generation and scoring of the alignment.
To score the alignment in the way you describe would, I think, require
probably trivial reprogramming of needle, via a new "advanced" option.
Could be done for the next release or sooner, how urgent do you need it?
Scoring in the way you describe is a reasonable thing to do, but if
N is not to contribute to the score it should not contribute to the
alignment either, so you'd need to adjust the scoring matrix so that
all substitutions involving N are neutral - I guess by specifying a
value of zero for them.
Just my two penneth' :)
> I have an additional question about needle, as I would like to
> actually remove noninformative bases from the final alignment score:
> ie. If the sequence follows
> With suggested matrix weight changes I would expect to see a 100%
> similarity of 10/10 bases
> However, it is more informative to me to to see 100% similarity of 7/7
> bases (with N no longer aiding my alignment score). One could imagine
> an artificial similarity score inflation if the entire length is used
> to generate the score...ie. if 100 bases were being aligned to 100 bp
> sequence (containing 10 "Ns"), and then 5 of those bases were an
> informative mismatch:
> Needle would currently provide:
> 95/100 (or simply 95% similarity)
> But the answer needed would be:
> 85/90 (or 94.4% similarity).
> Does this make sense?
> Thank you in advance for any help you can offer!
> On 2/8/07, Karen Hayden <kehayden at gmail.com> wrote:
>> Hey Peter,
>> That was absolutely perfect. Thank you!
>> Best wishes,
>> On 2/8/07, pmr at ebi.ac.uk <pmr at ebi.ac.uk> wrote:
>> > Dear Karen,
>> > > I am currently using needle to generate an alignment between two
>> > > sequences which contain non-informative bases (ie, identified low
>> > > quality bases (phred scores) and have been changed to "N").
>> > > Presently, these bases are penalized as any other non-matching
>> > > character. Is there any way to change needle to "overlook" these
>> > > bases when generating the best scoring alignment (or, do I need to
>> > > write my own version of needle?)
>> > There are two matrix files for nucleotide comparisons. The default is
>> > EDNAFULL which counts N as an average of all possible scores (1 match
>> > against 3 possible mismatches).
>> > The alternative is EDNAMAT which only scores exact matches like blastn
>> > (use -data EDNAMAT on the command line to see the difference).
>> > But you can also copy EDNAMAT to your local directory with
>> > embossdata EDNAFULL -fetch
>> > mv EDNAFULL EDNAPHRED
>> > (best to do this rename or you will accidentally be using this file by
>> > default for other needle runs in the same directory)
>> > edit EDNAPHRED to have the scores you want for N (perhaps +1 for a small
>> > match to ACGTU, +2 for a match to a 2-base code RYSWKM, +3 for a match to
>> > a 3-base code BDHV and +4 for a match to another N.
>> > Then run with:
>> > needle -data EDNAPHRED
>> > If enough users think this is a meaningful scoring system we could add
>> > such a matrix to the distribution. Let us know if it really gives you more
>> > useful scores. My natural prejudice is to trust EDNAFULL. I guess you are
>> > expecting to often find the base in the other sequence is the one phred
>> > started with, which will indeed bias the scoring.
>> > Hope this helps,
>> > Peter
>> Karen E. Hayden
>> Starving Graduate Student
>> Duke University
>> Durham, NC 27708
> Karen E. Hayden
> Starving Graduate Student
> Duke University
> Durham, NC 27708
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
More information about the EMBOSS