MLUG: Re: [MLUG] String manipulation in C
Re: [MLUG] String manipulation in C
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

I think that using the C code provided by Mark Rages (with the little correction I added) is probably the easiest way to go. If that is too slow, then you probably need to put some serious thought into reducing the problem (perhaps look for common bits in the short DNA sequences to make the search more efficient). This would be a serious programming project.


On Sun, 1 Jul 2007, Jack Smith wrote:

You hit the nail right on the head with what I need to do, Dr. Smith. My
project is doing gene sequence to DNA probe mapping. I have a file with
600k lines of 5-50 base-pair (letter) probes and I need to see if there
are sequences that are identical to the probes' sequences in the DNA
sequences. The chromosomal DNA sequence fragments are roughly 500-1500
bp long and there are about 29k of them. I need to see any and all
matches between the probes and the chromosomal DNA as well as where in
that DNA sequence the match occurs. In short, I want something like
this:

Probename  Sequence    Gene name   Match Start BP   Match End BP
Probe1     AAGGCC      Gene1       50               55
Probe1     AAGGCC      Gene1       95               100
Probe2     CCGACGT     Gene1
Probe3     [AG]CCT     Gene1       65               68

My MATLAB code will output the multiple matches between a probe and a
gene like so:

match_start(<gene#>,<probe#>)

    [3xDouble]

Which can be read by:
match_start{<gene#>,<probe#>}

    137   267   802

And the length of the probe added to the first array yields the ending
BP:
match_end{<gene#>,<probe#>}
    143   273   808

That's fine and dandy, but it's too slow to crunch the data.

The C code I have seems pretty fast- it can compare the same 8000 probes
versus one gene in 0.4 seconds, versus 16 seconds to do it in MATLAB.
But it does not do multiple matches.

This looks like a huge operation.

Yes, it sure is. The gene sequence files are in FASTA format, which has some MATLAB import scripts that go with it. (FASTA is still plaintext and the files are easily grepped and read into other programs) I would have stuck with MATLAB as it was pretty quick and easy to get running and it works but it runs far too slow and uses too much RAM to be very useful. That is why I was going to try to perform the operations using something else other than MATLAB if at all possible. But my knowledge of the different languages and utilities is not complete and such I ran into a little stumbling block and had to ask for help.

So I suppose using Perl would be what I want to do to get my output? I
can try to hack my way though some Perl if need be. I just need to get
my output and get it done in a reasonable amount of time and with a
reasonable amount of RAM- something less than weeks and something less
than four gigs. HDD space is no problem as this machine has a
half-terabyte RAID 5 in it.

Jack


_______________________________________________ members mailing list EMAIL:PROTECTED http://mlug.missouri.edu/mailman/listinfo/members


_______________________________________________ members mailing list EMAIL:PROTECTED http://mlug.missouri.edu/mailman/listinfo/members