Email address obfuscation in effect -- please
click here to turn it off.
[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
I think that using the C code provided by Mark Rages (with the little
correction I added) is probably the easiest way to go. If that is too
slow, then you probably need to put some serious thought into reducing the
problem (perhaps look for common bits in the short DNA sequences to make
the search more efficient). This would be a serious programming project.
On Sun, 1 Jul 2007, Jack Smith wrote:
You hit the nail right on the head with what I need to do, Dr. Smith. My
project is doing gene sequence to DNA probe mapping. I have a file with
600k lines of 5-50 base-pair (letter) probes and I need to see if there
are sequences that are identical to the probes' sequences in the DNA
sequences. The chromosomal DNA sequence fragments are roughly 500-1500
bp long and there are about 29k of them. I need to see any and all
matches between the probes and the chromosomal DNA as well as where in
that DNA sequence the match occurs. In short, I want something like
this:
Probename Sequence Gene name Match Start BP Match End BP
Probe1 AAGGCC Gene1 50 55
Probe1 AAGGCC Gene1 95 100
Probe2 CCGACGT Gene1
Probe3 [AG]CCT Gene1 65 68
My MATLAB code will output the multiple matches between a probe and a
gene like so:
match_start(<gene#>,<probe#>)
[3xDouble]
Which can be read by:
match_start{<gene#>,<probe#>}
137 267 802
And the length of the probe added to the first array yields the ending
BP:
match_end{<gene#>,<probe#>}
143 273 808
That's fine and dandy, but it's too slow to crunch the data.
The C code I have seems pretty fast- it can compare the same 8000 probes
versus one gene in 0.4 seconds, versus 16 seconds to do it in MATLAB.
But it does not do multiple matches.
This looks like a huge operation.
Yes, it sure is. The gene sequence files are in FASTA format, which has
some MATLAB import scripts that go with it. (FASTA is still plaintext
and the files are easily grepped and read into other programs) I would
have stuck with MATLAB as it was pretty quick and easy to get running
and it works but it runs far too slow and uses too much RAM to be very
useful. That is why I was going to try to perform the operations using
something else other than MATLAB if at all possible. But my knowledge of
the different languages and utilities is not complete and such I ran
into a little stumbling block and had to ask for help.
So I suppose using Perl would be what I want to do to get my output? I
can try to hack my way though some Perl if need be. I just need to get
my output and get it done in a reasonable amount of time and with a
reasonable amount of RAM- something less than weeks and something less
than four gigs. HDD space is no problem as this machine has a
half-terabyte RAID 5 in it.
Jack
_______________________________________________
members mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/members
_______________________________________________
members mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/members