Email address obfuscation in effect -- please
click here to turn it off.
[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
Jonathan King wrote:
On 6/5/06, Stephen Montgomery-Smith <EMAIL:PROTECTED> wrote:
Mike Miller wrote:
>
> Sort of. It really depends on the situation. I'm not always sure what
> drives arguments about "Bayesian" and "frequentist" perspectives on
> inference, but I think a lot of it is due to the fact that it is a
> difficult topic and a statistician can get by professionally without
> ever really coming to grips with the core philosophical problems.
My sense is that statistics as a field doesn't have the strong
foundations that, say, math or physics has. I think this field needs a
"Newton" to come along and sort it all out. But I also think that the
time is ripe for this to happen, just like Einstein's theories were ripe
for their time.
I'm not sure the analogy is exact. In one sense, statistics did
already have a Newton: Bayes Theorem is about as amazing and basic a
thing as you're ever likely to get in almost any field. But what I
think the Reverend Bayes couldn't do (and arguably shouldn't have
done) is tell us what our priors should be. There is a lot of work
being done on establishing UNinformative priors that everybody could
agree on, but which allow statistical inferences to be made with some
reasonable amount of efficiency. I know why this is being done, but I
don't see it as the big problem.
I don't see Bayes Theorem as the answer. I have tried some of these
uninformative priors in my species counting problem, and they fail
dramatically (they always predict that the number of species you haven't
seen is zero).
Quite likely I am going to put a lot of thought into this problem next
year (now I have other projects), but one possibility I am considering
is that the Kolmogorov laws of probability don't always apply.
The problem that has driven my thinking is this one. Suppose you go out
and capture 1000 birds. You observe 16 different species, but 5 of them
you only observe once. Try to give a lower estimate on how many species
you didn't observe. Note that 0 is quite unlikely, because 5 of the
species are quite likely very sparse,
....or are very good at eluding capture. And I mention this not just
to be cute, but because this (the notion that species don't differ in
their capturability) is the kind of convenience assumption that you
pretty much have to make at first, but which is exactly what will mess
with you later.
But I am considering an idealized "thought experiment" so I don't need
to consider all of the practical problems. And even this idealized
thought experiment shows that Bayes Theorem just doesn't cut it.
and if there were only 5 of these
sparse species, quite likely you wouldn't have observed them all.
Yes.
This problem is obviously important to ecology and is well studied:
http://viceroy.eeb.uconn.edu/EstimateS. Chao's work on this is quite
brilliant, but is essentially ad-hoc.
Wow, that seems pretty cool. I haven't read Chao, but is the part you
have a problem with is the assumption that we should use the
frequencies of rare, shared species to estimate the correction for
shared unseen species?
Yes. Her lower estimate of the number of unobserved species is n1/(2
n2^2) where nk is the number of species observed k times. Note that if
n2=0 her estimate is infinity, but if you think about it that is not
actually such a bad estimate (i.e. there are huge numbers of rare
species and you have no reasonable way of guessing how many there are).
I tried a Baysian approach, and
its dependence upon priors is tremendous, and in any case it always seem
to estimate too high.
You mean you ran it on artificial data sets and the bias high is
unacceptably large?
Yes.
But there should be a method that works even for artificial data sets.
Indeed I have a suggestion for a formula which my gut tells me is
correct. But I cannot prove the formula by any argument. Nor can I
conduct trials because the computations are impossibly computationally
intensive - I plan to find good approximations to this formula next year.
Let me also add that Chao's formula works very well on these artificial
data sets.
My thinking is that if you really study this model problem, then you
have some hope of getting closer to what the foundations of statistics
really should be. It is more difficult than the simple, model problems,
but much easier than most real life problems (e.g. microarrays).
So how simple a problem do you think we should go to? I think a
(slightly) simpler problem is to use catch/re-catch probabilities to
estimate population sizes of a given species in a closed system like
fish in a lake.
My guess is that this problem's complexity is about the same.
>> I do admit that I am not an expert in statistics, and my guess is that
>> you and Jon know way more than I do.
No way, Stephen. :-) I know nothing. I only know a lot of the stuff
that can go wrong.
You and Mike have way more experience in the field. I used to avoid
that stats classes at college level (in part because it seemed so hocus
pocus to me). I have to say that when I am arguing about stats with
someone I am trying to figure out why they do their test, using my
knowledge about probability. I think this gives the other person the
sense that I am making it up on the spot, and in some sense this is true.
I have a sense that this person was in denial about the problems I
brought up. He told me emphatically that there was no reasonable reason
to suppose that the changes in DNA were related to each other - but then
in another email told me that changes in DNA are found NOT to be a
Poisson process because the variance is about 2 times too big. These
two statements contradict each other.
Ouch. I may misunderstand what he was saying to you, and he may have
misunderstood what you were saying to him, but, yeah, that's not so
nice.
This was a discussion on talk.origins. It is a very difficult place to
maintain a reasonable discussion, because if you don't tow their line,
you get a lot of heckling. Quite likely if I had pursued the issue he
would have eventually got it, but I got to the point where it wasn't
worth the pain. (And I got a bit of a sense that he wasn't a real expert.)
Current models of statistics seem unable to deal with events with "large
tails in their probability distribution functions" - that is, an event
that is unlikely, but when it does happen, it makes a huge difference.
That's certainly a problem, although I'm not sure I see it as the key
problem.
Quite likely you are correct here.
(Incidently, I think that large tailed events are one of the problems
with microarray analysis - an example of a large tailed event is that
one of the microarray chips got a scratch. And from my brief reading of
the microarray literature, this is a real consideration.)
Microarrays are incredibly fascinating things. But not to distract
you, I think you should also be really interested in protein
expression data. So there's a new technique called DIGE, short for
Difference in Gel Electrophoresis. You label one protein sample (the
control) with one fluorescent dye, the experimental condition using a
second dye, the pooled control with a third dye, and run them on the
same 2-D gel, which you scan 3 times for the different dyes.
Beautiful stuff, but it has many of the same interpretational issues
that microarrays do.
I thought about microarrays, at Mike's and your suggestion, because my
initial thinking that this was just high dimensional gaussian processes,
which was the subject of my Ph.D. thesis. But reading the literature I
got the sense that maybe it was more than that.
But also the microarray literature is very hard to understand by an
outsider. I recently spoke to a friend who has developed a really nice
clustering algorithm, and I asked him if he had thought about
microarrays. He told me that he had the same difficulties that I had.
Anyway, I generally felt that I didn't have a useful expertize that
those already in the field didn't already have. So I don't plan to work
directly in this field right now.
Stephen
_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion