Email address obfuscation in effect -- please
click here to turn it off.
[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
On Mon, 5 Jun 2006, Stephen Montgomery-Smith wrote:
Now with my species count problem, if you suppose that the number of
species is n, then the populations of each species is with probabilities
p1,p2,...,pn where pk>=0 and sum pk = 1. What prior distribution should
you put on these p's? The uninformed distribution is 1/(p1p2...pn),
which like the previous example, means that any species you haven't seen
will, after normalization, with certainty not exist.
If you are interested in discovering what n is, how can you have a model
that doesn't include a prior distribution on unknown n? Do you have
probabilities p1,p2,...,pn conditional on n for every possible n and also
a prior distribution for n?
You have a finite sample, so the correct model really involves
n1,n2,n3,...,nn where the ni are the numbers of members of the species
that could be captured. You could treat this as multivariate
hypergeometric, or a weighted version of that, but you want to avoid
assuming that an animal that is down to a population of 10, say, could
account for 100 of the ascertained animals. Maybe this isn't a problem if
you assume a very large population for every animal so that the
multinomial can apply, but in reality, some animals exist in small numbers
and the sampling with/without replacement issue will be important.
So you are looking for potential prior distributions on this (n-1)
dimensional "tetrahedron" (mathematicians call it a simplex). If you do
the obvious uniform distribution, and do the calculations (which are not
so easy by the way) you get an answer that doesn't depend on how many of
each species you observed, only on how many species you observed.
That makes sense to me because the prior on the ps, weird as it is, has a
symmetry that gives all ps equal prior weight and it has a bizarre
infinite ballooning effect on density of ps for the observed events.
What happens if you use 1/sqrt(p1p2...pn) as the prior? I think that's
the Jeffreys' prior. That would kill the weird infinite ballooning
phenomenon.
I did try other distributions on this (n-1)-simplex, but none of them
really worked well, although they all did better than the uniform
distribution.
I can see another way that this is tricky. You know some species are more
common than others, but how can you assign priors to the species when you
might not even know what the species are?! So it makes sense that the
prior would be symmetrical or uniform unless you know something about the
species you will be studying.
(I assumed a uniform distribution on the prior of n itself, but my sense
is that the prior on n will not play such a big role. It is the "curse
of high dimensionality" that really shows up the inadaquate nature of
uninformed prior distributions.)
A uniform distribution (improper, I assume) means that it is as likely
that there are 1,000,000 species as that there are 2 species. That seems
unreasonable. I don't know how much your results would change if you made
the uniform distribution end at some large value, or made it triangular,
say, so that it dropped off on the high end or on both ends.
Quite likely I am going to put a lot of thought into this problem next
year (now I have other projects), but one possibility I am considering
is that the Kolmogorov laws of probability don't always apply.
You frighten me. :-)
Well, I overstate it a bit. An underlying assumption is that there is a
numerical value of "believability" that you can place on any real life
event. But the correct uninformed prior distribution is 0/0, or Nan (not
a number), that is, undefined. That is, you cannot place a numerical
value on it. (Indeed the very fact that the uninformed priors are not
proper should immediately tell you that something is amiss.)
There is more than one choice. I don't think you should say that your
prior is "the correct uninformed prior." See my earlier messages about
Jaynes' prior and Jeffreys' prior.
But then after doing some experiments, maybe you still cannot place a
precise numerical value on the distribution, but it is somehow midway
between Nan and a genuine numerical distribution (e.g. "probably bigger
than 5 but definitely less than 6" does not tell you what the
distribution is, yet is not a totally uninformative statement).
I have totally no idea how to make any of this work. But I think it is
worth thinking about.
I think the Jaynes' prior is really messing things up. It must only be
useful when there is at least one observation in every category. Thus, I
don't think it is useful in your work.
> So how simple a problem do you think we should go to? I think a
> (slightly) simpler problem is to use catch/re-catch probabilities to
> estimate population sizes of a given species in a closed system like
> fish in a lake.
My guess is that this problem's complexity is about the same.
You would think so, but isn't the catch-recatch method to estimate
population size a solved problem?
Real life statisticians might be intimidated by people who question
their assumptions. But similarly I am intimidated by statisticians who
have real world experience, and who know the literature, as well as the
proper lingo.
So I thought I vaguely knew what the capture-recapture method is (don't
you tag the birds you catch and see how long it is until you catch it
again.)? But I didn't know any more than that (like it was a solved
problem).
The terminology for Google searching is really "capture recapture."
Here's a simple page:
http://www.figurethis.org/challenges/c52/challenge.htm
The assumptions are unstated there but pretty obvious. I'm sure there are
much more sophisticated approaches.
Mike
_______________________________________________
discussion mailing list
EMAIL:PROTECTED
http://mlug.missouri.edu/mailman/listinfo/discussion