MLUG: Re: [MLUG - DISCUSSION] statistical inference
Re: [MLUG - DISCUSSION] statistical inference
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On 6/5/06, Stephen Montgomery-Smith <EMAIL:PROTECTED> wrote:
Jonathan King wrote:
> On 6/5/06, Stephen Montgomery-Smith <EMAIL:PROTECTED> wrote:
>>
>> My sense is that statistics as a field doesn't have the strong
>> foundations that, say, math or physics has.  I think this field needs a
>> "Newton" to come along and sort it all out.  But I also think that the
>> time is ripe for this to happen, just like Einstein's theories were ripe
>> for their time.
>
> I'm not sure the analogy is exact.  In one sense, statistics did
> already have a Newton: Bayes Theorem is about as amazing and basic a
> thing as you're ever likely to get in almost any field.  But what I
> think the Reverend Bayes couldn't do (and arguably shouldn't have
> done) is tell us what our priors should be.  There is a lot of work
> being done on establishing UNinformative priors that everybody could
> agree on, but which allow statistical inferences to be made with some
> reasonable amount of efficiency.  I know why this is being done, but I
> don't see it as the big problem.

I don't see Bayes Theorem as the answer.  I have tried some of these
uninformative priors in my species counting problem, and they fail
dramatically (they always predict that the number of species you haven't
seen is zero).

Wow; that does seem to be a problem. Are you insisting that your priors be proper or something? You seriously might consider talking to Jeff Rouder about this; he and his statistics pals ran up against some visciously nasty weirdness trying to get some of his stuff to work.

Quite likely I am going to put a lot of thought into this problem next
year (now I have other projects), but one possibility I am considering
is that the Kolmogorov laws of probability don't always apply.

You frighten me. :-)

>> The problem that has driven my thinking is this one.  Suppose you go out
>> and capture 1000 birds.  You observe 16 different species, but 5 of them
>> you only observe once.  Try to give a lower estimate on how many species
>> you didn't observe.  Note that 0 is quite unlikely, because 5 of the
>> species are quite likely very sparse,
>
> ....or are very good at eluding capture.  And I mention this not just
> to be cute, but because this (the notion that species don't differ in
> their capturability) is the kind of convenience assumption that you
> pretty much have to make at first, but which is exactly what will mess
> with you later.

But I am considering an idealized "thought experiment" so I don't need
to consider all of the practical problems.  And even this idealized
thought experiment shows that Bayes Theorem just doesn't cut it.

Again, that is very surprising to me.

>> This problem is obviously important to ecology and is well studied:
>> http://viceroy.eeb.uconn.edu/EstimateS.  Chao's work on this is quite
>> brilliant, but is essentially ad-hoc.
>
> Wow, that seems pretty cool.  I haven't read Chao, but is the part you
> have a problem with is the assumption that we should use the
> frequencies of rare, shared species to estimate the correction for
> shared unseen species?

Yes.  Her lower estimate of the number of unobserved species is n1/(2
n2^2) where nk is the number of species observed k times.  Note that if
n2=0 her estimate is infinity, but if you think about it that is not
actually such a bad estimate (i.e. there are huge numbers of rare
species and you have no reasonable way of guessing how many there are).

OK.

>> I tried a Baysian approach, and
>> its dependence upon priors is tremendous, and in any case it always seem
>> to estimate too high.
>
> You mean you ran it on artificial data sets and the bias high is
> unacceptably large?

Yes.

But there should be a method that works even for artificial data sets.

Of course.

Indeed I have a suggestion for a formula which my gut tells me is
correct.  But I cannot prove the formula by any argument.  Nor can I
conduct trials because the computations are impossibly computationally
intensive - I plan to find good approximations to this formula next year.

Let me also add that Chao's formula works very well on these artificial
data sets.

Well, I guess being able to calculate an answer with existing hardware is something of an advantage. :-)

>> My thinking is that if you really study this model problem, then you
>> have some hope of getting closer to what the foundations of statistics
>> really should be.  It is more difficult than the simple, model problems,
>> but much easier than most real life problems (e.g. microarrays).
>
> So how simple a problem do you think we should go to?  I think a
> (slightly) simpler problem is to use catch/re-catch probabilities to
> estimate population sizes of a given species in a closed system like
> fish in a lake.

My guess is that this problem's complexity is about the same.

You would think so, but isn't the catch-recatch method to estimate population size a solved problem?

>> >> I do admit that I am not an expert in statistics, and my guess is that
>> >> you and Jon know way more than I do.
>
> No way, Stephen. :-)  I know nothing.  I only know a lot of the stuff
> that can go wrong.

You and Mike have way more experience in the field.  I used to avoid
that stats classes at college level (in part because it seemed so hocus
pocus to me).

It is kind of.

I have to say that when I am arguing about stats with
someone I am trying to figure out why they do their test, using my
knowledge about probability.

Ooh; bad move. They used the test they used because everybody else uses it. :-)

I think this gives the other person the
sense that I am making it up on the spot, and in some sense this is true.

I think most people would find this awesomely intimidating, since they don't usually consider using anything else than what they've always used.

This was a discussion on talk.origins.  It is a very difficult place to
maintain a reasonable discussion, because if you don't tow their line,
you get a lot of heckling.  Quite likely if I had pursued the issue he
would have eventually got it, but I got to the point where it wasn't
worth the pain.  (And I got a bit of a sense that he wasn't a real expert.)

Real experts tend not to hang around Usenet for very long. It's just not likely to be worth their time.

[snip]

I thought about microarrays, at Mike's and your suggestion, because my
initial thinking that this was just high dimensional gaussian processes,
which was the subject of my Ph.D. thesis.  But reading the literature I
got the sense that maybe it was more than that.

But also the microarray literature is very hard to understand by an
outsider.  I recently spoke to a friend who has developed a really nice
clustering algorithm, and I asked him if he had thought about
microarrays.  He told me that he had the same difficulties that I had.

Anyway, I generally felt that I didn't have a useful expertize that
those already in the field didn't already have.  So I don't plan to work
directly in this field right now.

Well, this is not unexpected. Schena invented these things like 10 years ago, and it takes about that long for any process like this to be well-enough understood even by a sufficient number of hands-on experts who could possibly communicate anything to somebody outside of biology and get something non-trivial back. To be completely honest, when I was messing around with (other people's) microarray data last semester, I was amazed at how, uh, crappy it was, quality-wise. (Yeah, I only reached that conclusion after I'd hyped it to you last fall...sorry.)

The good news is that the whole process is getting better and cheaper,
so I think we'll see better things coming out of the field due almost
solely to the fact that the epression levels will at least be accurate
to within 20%.  Then maybe five years from then (which could be now, I
guess), there will be something to do with it that won't just be error
and failure mode analysis.

jking

_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion