MLUG: Re: [MLUG] String manipulation in C
Re: [MLUG] String manipulation in C
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
You hit the nail right on the head with what I need to do, Dr. Smith. My
project is doing gene sequence to DNA probe mapping. I have a file with
600k lines of 5-50 base-pair (letter) probes and I need to see if there
are sequences that are identical to the probes' sequences in the DNA
sequences. The chromosomal DNA sequence fragments are roughly 500-1500
bp long and there are about 29k of them. I need to see any and all
matches between the probes and the chromosomal DNA as well as where in
that DNA sequence the match occurs. In short, I want something like
this:

Probename  Sequence    Gene name   Match Start BP   Match End BP
Probe1     AAGGCC      Gene1       50               55
Probe1     AAGGCC      Gene1       95               100
Probe2     CCGACGT     Gene1 
Probe3     [AG]CCT     Gene1       65               68
          
My MATLAB code will output the multiple matches between a probe and a
gene like so:

match_start(<gene#>,<probe#>)

     [3xDouble]

Which can be read by:
match_start{<gene#>,<probe#>}

     137   267   802

And the length of the probe added to the first array yields the ending
BP:
match_end{<gene#>,<probe#>}
     143   273   808

That's fine and dandy, but it's too slow to crunch the data.

The C code I have seems pretty fast- it can compare the same 8000 probes
versus one gene in 0.4 seconds, versus 16 seconds to do it in MATLAB.
But it does not do multiple matches. 

> This looks like a huge operation.

Yes, it sure is. The gene sequence files are in FASTA format, which has
some MATLAB import scripts that go with it. (FASTA is still plaintext
and the files are easily grepped and read into other programs) I would
have stuck with MATLAB as it was pretty quick and easy to get running
and it works but it runs far too slow and uses too much RAM to be very
useful. That is why I was going to try to perform the operations using
something else other than MATLAB if at all possible. But my knowledge of
the different languages and utilities is not complete and such I ran
into a little stumbling block and had to ask for help. 

So I suppose using Perl would be what I want to do to get my output? I
can try to hack my way though some Perl if need be. I just need to get
my output and get it done in a reasonable amount of time and with a
reasonable amount of RAM- something less than weeks and something less
than four gigs. HDD space is no problem as this machine has a
half-terabyte RAID 5 in it. 

Jack


_______________________________________________
members mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/members