A new demonstration
Consider the cell a, and call b its complement, given any evidence p(a | ) + p(b | ) = 1. By p(a | m,s) we describe the evidence as the number of successes and the number of trials. Hence, on the same evidence na = m, nb = s — m, the normalization condition can be written as p(a | m,s) + p(b | s — m,s) = 1; p(a):= p(a | 0,0) is the initial probability, and p(a) + p(b) = 1. To calculate p(a| m,s), we consider the pattern a1 , a2, ... , am, bm+1,... ,bs. Along a1 , a2 ... am we have all failures for b, and they are m: then p(b | m,m) = Q(0) ... Q(m — 1)p(b), and p(a | m,m) = 1 — p(b | m,m) is a function of Q(0) ... Q(m — 1) and p(a) = 1 — p(b). Along the second part of the path, that is, bm+1,... ,bs, we have s — m failures for a, and then p(a | m,s) = Qm ... Qs—1 p(a | m,m). Then p(a |m,s) is a function of Q(0) ... Q(s—1) and p(a). The same can be done for p(b | s — m,s) by following the pattern b1 , b—m,as—m+1,... ,as, from which results a function of Q(0) ... Q(s—1) and p(b). The condition p(a | m,s) + p(b | s — m,s) = 1 provides a recurrent equation for Q(i) which can be solved in terms of Q(0):
. , . . . Q(0) n
whose solution is, posing A =-= —-— :
F S 1 - Q( 0) 1 - n
If we substitute (14) in the previous formulas, it follows that
A pi + ni
P(Xn+i = j | D) = j^-? (Q.E.D.).
It is worth noting that Karl Pearson has referred to the problem of reaching the values of (2) as the fundamental problem of practical statistics (Pearson, 1920). With this specification in mind, we shall call main theorem the solution of the problem posed by Pearson. The main theorem was first proved by Johnson (1932), who presumably did not
know that a couple of years before Pearson had faced the same problem. However, Johnson's proof was incomplete because it implicitly assumes the equality of all initial probabilities. A satisfactory proof of the main theorem was given by Kemeny (1963). The proof given by Carnap (see Carnap, 1980) is essentially that of Kemeny. All these proofs are based on the X-principle and, with the exception of Carnap, assume that d > 2 and do not worry about dichotomies.
The main theorem states that, if a probability is regular, exchangeable and invariant, then
holds, where p = P(Xn+1 = j) is the initial probability of the cell j and
pj and X are free parameters of (15), and X > 0 because of regularity. Thus, in order to arrive at specific predictive probabilities, that is, numerical values of (15), we must choose the initial probabilities of each cell and the value of the relevance quotient at V. As a consequence, considering various values of these parameters we arrive at various values of probability. We call Xp> the initial weight of the cell j and X = ^, Xpj the total weight. As A + n = ^, (Ipj + n^), (15) shows that the probability of j is the normalized final weight of j, this being the sum of both its initial weight and its occupation number. If we write (15) as
it becomes apparent that the predictive probability is a weighted mean of a prior factor determined before having performed any observation, the initial probability, and an empirical factor, the relative frequency that can be determined only after observations has been performed.
Referring to the definition of X given in (16), we see that in the case in which h = 1, X is meaningless. This means that in (15) X can grow without limit but cannot be infinite. Thus, stochastic independence appears in (15) as a limiting case. In this limiting case, the final probability is always equal to the initial probability pj whatever the evidence is. It is easy to realize that positive values of X introduce a positive correlation while negative values of X introduce a negative correlation. All proofs of the main theorem given in the line of thought suggested by Johnson consider 0 < X < ™. Thus, referring to these studies, we are not entitled to work with negative values of X. However, the main theorem has been proved for negative values of X, too (Costantini and Garibaldi, 1991). This extension is not very important for inductive applications, and this is because in the statistical inference only positive values of X have been considered. In statistical inferences, negative correlation can be deal with assuming only C2 (see Carnap, 1950, p. 207), that is, considering direct inferences. Carnap called "direct" a statistical inference aiming at determining the probabilities of all the possible statistical distributions of a sample drawn from a population whose statistical distribution is known. In other words, when the statistical distribution of the population is known, by using exchangeability and nothing else it is possible to prove that the hypergeometric distribution specifies the probability of all samples that can be drawn from the population. In this case invariance is automatically satisfied.