MLUG: Re: [MLUG - DISCUSSION] statistical inference
Re: [MLUG - DISCUSSION] statistical inference
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On 6/5/06, Stephen Montgomery-Smith <EMAIL:PROTECTED> wrote:
Mike Miller wrote:
>
> Sort of.  It really depends on the situation.  I'm not always sure what
> drives arguments about "Bayesian" and "frequentist" perspectives on
> inference, but I think a lot of it is due to the fact that it is a
> difficult topic and a statistician can get by professionally without
> ever really coming to grips with the core philosophical problems.

My sense is that statistics as a field doesn't have the strong
foundations that, say, math or physics has.  I think this field needs a
"Newton" to come along and sort it all out.  But I also think that the
time is ripe for this to happen, just like Einstein's theories were ripe
for their time.

I'm not sure the analogy is exact. In one sense, statistics did already have a Newton: Bayes Theorem is about as amazing and basic a thing as you're ever likely to get in almost any field. But what I think the Reverend Bayes couldn't do (and arguably shouldn't have done) is tell us what our priors should be. There is a lot of work being done on establishing UNinformative priors that everybody could agree on, but which allow statistical inferences to be made with some reasonable amount of efficiency. I know why this is being done, but I don't see it as the big problem.

Now, one issue is this one:

> Sort of, but not quite.  What Edwards points out is that the likelihood
> contains all the information about the parameters that can be found in
> the data.  In a Bayesian analysis, one uses the likelihood along with a
> "prior" which is a sort of weighting scheme based on, well, based on
> whatever the hell you want it to be based on -- and that's the problem
> with Bayesian analysis, but that doesn't mean it isn't a good thing.

I'll have to look at it.  But a problem with his scenario is that he is
deciding between to possibilities p(1/2) and p(1/4).  But really you
would be deciding between all possible p between 0 and 1.  Then the
implicit assumption that all the priors are the same plays a much bigger
role than you might think.

The real problem with Bayes Theorm is that it doesn't require you to have all of the possible hypotheses stated in advance, so you can generate posterior probabilities that don't mean much when the "true" hypothesis isn't in the analysis. This is hardly unique to the Bayesian approach, but it's a real problem for people who want to attach great significance to the actual posteriors.

The problem that has driven my thinking is this one.  Suppose you go out
and capture 1000 birds.  You observe 16 different species, but 5 of them
you only observe once.  Try to give a lower estimate on how many species
you didn't observe.  Note that 0 is quite unlikely, because 5 of the
species are quite likely very sparse,

....or are very good at eluding capture. And I mention this not just to be cute, but because this (the notion that species don't differ in their capturability) is the kind of convenience assumption that you pretty much have to make at first, but which is exactly what will mess with you later.

and if there were only 5 of these
sparse species, quite likely you wouldn't have observed them all.

Yes.

This problem is obviously important to ecology and is well studied:
http://viceroy.eeb.uconn.edu/EstimateS.  Chao's work on this is quite
brilliant, but is essentially ad-hoc.

Wow, that seems pretty cool. I haven't read Chao, but is the part you have a problem with is the assumption that we should use the frequencies of rare, shared species to estimate the correction for shared unseen species?

I tried a Baysian approach, and
its dependence upon priors is tremendous, and in any case it always seem
to estimate too high.

You mean you ran it on artificial data sets and the bias high is unacceptably large?

My thinking is that if you really study this model problem, then you
have some hope of getting closer to what the foundations of statistics
really should be.  It is more difficult than the simple, model problems,
but much easier than most real life problems (e.g. microarrays).

So how simple a problem do you think we should go to? I think a (slightly) simpler problem is to use catch/re-catch probabilities to estimate population sizes of a given species in a closed system like fish in a lake.

>> I do admit that I am not an expert in statistics, and my guess is that
>> you and Jon know way more than I do.

No way, Stephen. :-) I know nothing. I only know a lot of the stuff that can go wrong.

>>  On the other hand I do think I
>> know a great deal about probability.  I recently saw an account of how
>> mitochondrial DNA could be used as evidence that all the different
>> types of ape (including the human being) must have a non-trivial tree
>> of ancestry.  Not being that familiar with statistics, I thought about
>> why the test he chose (a chi-squared test) was appropriate,

Heh. Chi-squared is often the test you use when you don't think you have anything else better to do. Your concerns are quite valid, but the best that can be said is that people have simulated all kinds of situations where the test assumptions are not met in various ways, and Chi-Square (also F) turn out to be decently robust to many (but not all) of them.

>> could see that it had many underlying assumptions, not all of which
>> were reasonable - (in his case that evolutionary pressures might not
>> cause a change in the DNA in one place to speed up changes in DNA in
>> other positions).

Yup, independence is probably the single most abused assumption out there. And the reason for that is clear: dependence is so much more difficult to deal with.

>> He computed an absurdly small p value, which meant
>> that he could reject his null hypothesis.  But it made me question the
>> value of these tests in being able to produce absurdly small p values,
>> because the assumptions he made, which were reasonable, nevertheless
>> could be violated with a probability which, while small, were way
>> bigger than the absurdly small p value he obtained.

Absolutely. And this is one reason why I don't put much stock in raw p values per se. Rejecting a null hypothes is nice, but I'm usually interested in which model is a better explanation for the data, not just establishing that some truly impoverished model (which is essentially a straw horse in many cases) does not fit the facts.

So Mike says:

> An awful lot of work in genetics has been done to show that violations
> of certain assumptions are not going to ruin a statistical test.  Most
> of statistics is about approximation and extracting meaning and
> direction from data -- random data that includes all sorts of errors.

Yes, but there are data errors and model specification errors. You're right, in many cases it turns out not to be absolutely killer that measurement errors aren't (say) normally distributed, but it is (to me) very different to assume something like independence when it isn't true.

I have a sense that this person was in denial about the problems I
brought up.  He told me emphatically that there was no reasonable reason
to suppose that the changes in DNA were related to each other - but then
in another email told me that changes in DNA are found NOT to be a
Poisson process because the variance is about 2 times too big.  These
two statements contradict each other.

Ouch. I may misunderstand what he was saying to you, and he may have misunderstood what you were saying to him, but, yeah, that's not so nice. So another example situation that would upset Stephen are the so-called E-values you see appended to BLAST results. Speaking of your really tiny but meaningless probability values...but in this case, people are fairly savvy about the problem, and it usually doesn't do too much harm since you can back up a claim that gene X and Y are similar, at least in function, by doing the biochemistry.

> But, of course, he wasn't testing that.  I'm sure he was assuming that
> "evolution is right."  My guess is that he didn't even mention
> creationism, probably because he wouldn't have any reason at this point,
> with all of his knowledge, even to consider the possibility that he
> should evoke a supernatural cause for anything he was studying.

This discussion took place in the context of evolution versus creationism.

Of course, the one thing about this particular argument is that you (Stephen) can say "fine, that's down to 1% based on the mDNA evidence" and then people can show you dozens of other probability calculations that have similar issues but are pretty much independent of this one, and at some point, you step back and say, "Wow, I'm not sure how many leading zeros I have in that probability, but it's certainly strong enough to teach in biology classes as the best working model we have, and a pretty strong model in any case."

Current models of statistics seem unable to deal with events with "large
tails in their probability distribution functions" - that is, an event
that is unlikely, but when it does happen, it makes a huge difference.

That's certainly a problem, although I'm not sure I see it as the key problem.

(Incidently, I think that large tailed events are one of the problems
with microarray analysis - an example of a large tailed event is that
one of the microarray chips got a scratch.  And from my brief reading of
the microarray literature, this is a real consideration.)

Microarrays are incredibly fascinating things. But not to distract you, I think you should also be really interested in protein expression data. So there's a new technique called DIGE, short for Difference in Gel Electrophoresis. You label one protein sample (the control) with one fluorescent dye, the experimental condition using a second dye, the pooled control with a third dye, and run them on the same 2-D gel, which you scan 3 times for the different dyes. Beautiful stuff, but it has many of the same interpretational issues that microarrays do.

jking

_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion