MLUG: Re: [MLUG - DISCUSSION] statistical inference
Re: [MLUG - DISCUSSION] statistical inference
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Jonathan King wrote:
On 6/5/06, Stephen Montgomery-Smith <EMAIL:PROTECTED> wrote:

Jonathan King wrote:
> On 6/5/06, Stephen Montgomery-Smith <EMAIL:PROTECTED> wrote:
>>
>> My sense is that statistics as a field doesn't have the strong
>> foundations that, say, math or physics has. I think this field needs a
>> "Newton" to come along and sort it all out. But I also think that the
>> time is ripe for this to happen, just like Einstein's theories were ripe
>> for their time.
>
> I'm not sure the analogy is exact. In one sense, statistics did
> already have a Newton: Bayes Theorem is about as amazing and basic a
> thing as you're ever likely to get in almost any field. But what I
> think the Reverend Bayes couldn't do (and arguably shouldn't have
> done) is tell us what our priors should be. There is a lot of work
> being done on establishing UNinformative priors that everybody could
> agree on, but which allow statistical inferences to be made with some
> reasonable amount of efficiency. I know why this is being done, but I
> don't see it as the big problem.


I don't see Bayes Theorem as the answer.  I have tried some of these
uninformative priors in my species counting problem, and they fail
dramatically (they always predict that the number of species you haven't
seen is zero).


Wow; that does seem to be a problem.  Are you insisting that your
priors be proper or something?  You seriously might consider talking
to Jeff Rouder about this; he and his statistics pals ran up against
some visciously nasty weirdness trying to get some of his stuff to
work.

I am allowing non-proper distributions, and I think that this is precisely the problem.


So consider this problem. You have a machine that delivers heads or tails. (I won't say "coin" because that gives you too much info.) The uninformed prior distribution for the parameter p that describes the probability that the machien will give you a head, is 1/p(1-p) which is definitely not proper.

So do the following experiment - do precisely one trial. (I.e. toss the coin one time.)

If you get a head, and you normalize the resulting distribution for p, you conclude that p=1 with certainty. Similarly if you get a tail, you conclude that p=0 with certainty. Clearly this is not the right answer.

Now with my species count problem, if you suppose that the number of species is n, then the populations of each species is with probabilities p1,p2,...,pn where pk>=0 and sum pk = 1. What prior distribution should you put on these p's? The uninformed distribution is 1/(p1p2...pn), which like the previous example, means that any species you haven't seen will, after normalization, with certainty not exist.

So you are looking for potential prior distributions on this (n-1) dimensional "tetrahedron" (mathematicians call it a simplex). If you do the obvious uniform distribution, and do the calculations (which are not so easy by the way) you get an answer that doesn't depend on how many of each species you observed, only on how many species you observed.

I did try other distributions on this (n-1)-simplex, but none of them really worked well, although they all did better than the uniform distribution.

(I assumed a uniform distribution on the prior of n itself, but my sense is that the prior on n will not play such a big role. It is the "curse of high dimensionality" that really shows up the inadaquate nature of uninformed prior distributions.)

Quite likely I am going to put a lot of thought into this problem next
year (now I have other projects), but one possibility I am considering
is that the Kolmogorov laws of probability don't always apply.

You frighten me. :-)

Well, I overstate it a bit. An underlying assumption is that there is a numerical value of "believability" that you can place on any real life event. But the correct uninformed prior distribution is 0/0, or Nan (not a number), that is, undefined. That is, you cannot place a numerical value on it. (Indeed the very fact that the uninformed priors are not proper should immediately tell you that something is amiss.)


But then after doing some experiments, maybe you still cannot place a precise numerical value on the distribution, but it is somehow midway between Nan and a genuine numerical distribution (e.g. "probably bigger than 5 but definitely less than 6" does not tell you what the distribution is, yet is not a totally uninformative statement).

I have totally no idea how to make any of this work. But I think it is worth thinking about.

> So how simple a problem do you think we should go to?  I think a
> (slightly) simpler problem is to use catch/re-catch probabilities to
> estimate population sizes of a given species in a closed system like
> fish in a lake.

My guess is that this problem's complexity is about the same.


You would think so, but isn't the catch-recatch method to estimate
population size a  solved problem?

Real life statisticians might be intimidated by people who question their assumptions. But similarly I am intimidated by statisticians who have real world experience, and who know the literature, as well as the proper lingo.


So I thought I vaguely knew what the capture-recapture method is (don't you tag the birds you catch and see how long it is until you catch it again.)? But I didn't know any more than that (like it was a solved problem).

Stephen

_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion