MLUG: Re: [MLUG - DISCUSSION] statistical inference
Re: [MLUG - DISCUSSION] statistical inference
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On Mon, 5 Jun 2006, Stephen Montgomery-Smith wrote:

Now with my species count problem, if you suppose that the number of species is n, then the populations of each species is with probabilities p1,p2,...,pn where pk>=0 and sum pk = 1. What prior distribution should you put on these p's? The uninformed distribution is 1/(p1p2...pn), which like the previous example, means that any species you haven't seen will, after normalization, with certainty not exist.

If you are interested in discovering what n is, how can you have a model that doesn't include a prior distribution on unknown n? Do you have probabilities p1,p2,...,pn conditional on n for every possible n and also a prior distribution for n?


You have a finite sample, so the correct model really involves n1,n2,n3,...,nn where the ni are the numbers of members of the species that could be captured. You could treat this as multivariate hypergeometric, or a weighted version of that, but you want to avoid assuming that an animal that is down to a population of 10, say, could account for 100 of the ascertained animals. Maybe this isn't a problem if you assume a very large population for every animal so that the multinomial can apply, but in reality, some animals exist in small numbers and the sampling with/without replacement issue will be important.


So you are looking for potential prior distributions on this (n-1) dimensional "tetrahedron" (mathematicians call it a simplex). If you do the obvious uniform distribution, and do the calculations (which are not so easy by the way) you get an answer that doesn't depend on how many of each species you observed, only on how many species you observed.

That makes sense to me because the prior on the ps, weird as it is, has a symmetry that gives all ps equal prior weight and it has a bizarre infinite ballooning effect on density of ps for the observed events. What happens if you use 1/sqrt(p1p2...pn) as the prior? I think that's the Jeffreys' prior. That would kill the weird infinite ballooning phenomenon.



I did try other distributions on this (n-1)-simplex, but none of them really worked well, although they all did better than the uniform distribution.

I can see another way that this is tricky. You know some species are more common than others, but how can you assign priors to the species when you might not even know what the species are?! So it makes sense that the prior would be symmetrical or uniform unless you know something about the species you will be studying.



(I assumed a uniform distribution on the prior of n itself, but my sense is that the prior on n will not play such a big role. It is the "curse of high dimensionality" that really shows up the inadaquate nature of uninformed prior distributions.)

A uniform distribution (improper, I assume) means that it is as likely that there are 1,000,000 species as that there are 2 species. That seems unreasonable. I don't know how much your results would change if you made the uniform distribution end at some large value, or made it triangular, say, so that it dropped off on the high end or on both ends.



Quite likely I am going to put a lot of thought into this problem next year (now I have other projects), but one possibility I am considering is that the Kolmogorov laws of probability don't always apply.

You frighten me. :-)

Well, I overstate it a bit. An underlying assumption is that there is a numerical value of "believability" that you can place on any real life event. But the correct uninformed prior distribution is 0/0, or Nan (not a number), that is, undefined. That is, you cannot place a numerical value on it. (Indeed the very fact that the uninformed priors are not proper should immediately tell you that something is amiss.)

There is more than one choice. I don't think you should say that your prior is "the correct uninformed prior." See my earlier messages about Jaynes' prior and Jeffreys' prior.



But then after doing some experiments, maybe you still cannot place a precise numerical value on the distribution, but it is somehow midway between Nan and a genuine numerical distribution (e.g. "probably bigger than 5 but definitely less than 6" does not tell you what the distribution is, yet is not a totally uninformative statement).

I have totally no idea how to make any of this work. But I think it is worth thinking about.

I think the Jaynes' prior is really messing things up. It must only be useful when there is at least one observation in every category. Thus, I don't think it is useful in your work.



> So how simple a problem do you think we should go to? I think a > (slightly) simpler problem is to use catch/re-catch probabilities to > estimate population sizes of a given species in a closed system like > fish in a lake.

My guess is that this problem's complexity is about the same.

You would think so, but isn't the catch-recatch method to estimate population size a solved problem?

Real life statisticians might be intimidated by people who question their assumptions. But similarly I am intimidated by statisticians who have real world experience, and who know the literature, as well as the proper lingo.


So I thought I vaguely knew what the capture-recapture method is (don't you tag the birds you catch and see how long it is until you catch it again.)? But I didn't know any more than that (like it was a solved problem).

The terminology for Google searching is really "capture recapture." Here's a simple page:


http://www.figurethis.org/challenges/c52/challenge.htm

The assumptions are unstated there but pretty obvious. I'm sure there are much more sophisticated approaches.

Mike

_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion