equals StartFraction p left-parenthesis y Subscript j Baseline bar x Subscript i Baseline right-parenthesis p left-parenthesis x Subscript i Baseline right-parenthesis Over sigma-summation Underscript i equals 1 Overscript upper L Endscripts p left-parenthesis y Subscript j Baseline bar x Subscript i Baseline right-parenthesis p left-parenthesis x Subscript i Baseline right-parenthesis EndFraction"/>
Bayes' Rule provides an update rule for probability distributions in response to observed information. Terminology:
p(xi ) is referred to as the “prior distribution on X” in this context.
p(xi ∣ yj ) is referred to as the “posterior distribution on X given Y.”
2.5.3 Estimation Based on Maximal Conditional Probabilities
There are two ways to do an estimation given a conditional problem. The first is to seek a maximal probability based on the optimal choice of outcome (maximum a posteriori [MAP]), versus a maximal probability (referred to as a “likelihood” in this context) given choice of conditioning (maximum likelihood [ML]).
MAP Estimate:
Provides an estimate of r.v. X given that Y = yj in terms of the posterior probability:
ML Estimate:
Provides an estimate of r.v. X given that Y = yj in terms of the maximum likelihood:
2.6 Emergent Distributions and Series
In this section we consider a r.v., X, with specific examples where those outcomes are fully enumerated (such as 0 or 1 outcomes corresponding to a coin flip). We review a series of observations of the r.v., X, to arrive at the LLN. The emergent structure to describe a r.v. from a series of observations is often described in terms of probability distributions, the most famous being the Gaussian Distribution (a.k.a. the Normal, or Bell curve).
2.6.1 The Law of Large Numbers (LLN)
The LLN will now be derived in the classic “weak” form. The “strong” form is derived in the modern mathematical context of Martingales in Appendix C.1.
Let Xk be independent identically distributed (iid) copies of X, and let X be the real number “alphabet.” Let μ = E(X), σ2 = Var(X), and denote
From Chebyshev: P(|
As N➔∞ get the LLN (weak):
If Xk are iid copies of X, for k = 1,2,…, and X is a real and finite alphabet, and μ = E(X), σ2 = Var(X), then: P(|
2.6.2 Distributions
2.6.2.1 The Geometric Distribution(Emergent Via Maxent)
Here, we talk of the probability of seeing something after k tries when the probability of seeing that event at each try is “p.” Suppose we see an event for the first time after k tries, that means the first (k − 1) tries were nonevents (with probability (1 − p) for each try), and the final observation then occurs with probability p, giving rise to the classic formula for the geometric distribution:
Figure 2.3 The Geometric distribution, P(X = k) = (1 − p)(k−1) p, with p = 0.8.
As far as normalization, i.e. do all outcomes sum to one, we have:
Total Probability = ∑k = 1(1 – p)(k−1) p = p[1 + (1 – p) + (1 – p)2 + (1 – p)3 + …] = p[1/(1 − (1 − p))] = 1
So total probability already sums to one with no further normalization needed. In Figure 2.3 is a geometric distribution for the case where p = 0.8:
2.6.2.2 The Gaussian (aka Normal) Distribution (Emergent Via LLN Relation and Maxent)