overall, raises sampling questions. The Norwalk virus has a genome that is only 7654 nucleotides long. As we sweep the six‐base window over that string to extract all of the hexamer counts we then obtain only 7654 − 5 = 7649 hexamer samples! Even with uniform distribution we will be getting barely two counts for most of the different hexamer types! Limitations due to sample size play a critical role in these types of analysis.
The Norwalk virus genome is actually smaller than the typical viral genome, which ranges between 10 000 and 100 000 bases in length. Prokaryotic genomes typically range between 1 and 10 million bases in length. While the human genome is approximately three billion bases in length (3.23 Gb per haploid genome, 6.46 Gb total diploid). To go forward with a “strong” statistical analysis in the current discussion, the key as with any statistical analysis, is sample size, which is obviously dictated in this analysis by genome size. So to have “good statistics” meaning to have sufficient samples that frequencies of outcomes provide a good estimation of the underlying probabilities for those outcomes, we will apply the methods developed thus far to a bacterial genome in Chapter 3 (the classic model organism, E. coli.). In this instance the genome size will be approximately four and a half million bases in length, so much better counts should result than with the 7654 base Norwalk virus genome.
2.2 Counting, the Enumeration Problem, and Statistics
In the example in the previous section we left off with counts on all 4096 hexamers seen in a given genome. If we go from counts on substrings of length 6 to substrings of length 30 we run into a problem – there are now a million million million (1018) substrings to get counts on. No genome is even remotely this large, so when getting counts on substrings in this situation most substring counts will necessarily be zero. Due to the large number of substrings, this is often referred to as “the enumeration problem,” but since counts need only be maintained that are nonzero, we are bounded by genome size, for which there is no enumeration problem. The main mechanism for capturing count information on substrings without dedicated (array) memory, is by use of associative memory constructs, such as the hash variable, and this technique is employed in the code examples.
2.3 From Counts to Frequencies to Probabilities
The conventional relations on probabilities say nothing as to their interpretation. According to the Frequentist (frequency‐based) interpretation, probabilities are defined in terms of fractions of a set of observations, as the number of observations tends to infinity (where the LLN works to advantage). In practice, infinite observations are not done, and often only one observation is done (predicting the winner of a marathon, for example). In the case of one race, however, it seems intuitive that prior information would still be beneficial to predicting winners. With the formal introduction of prior probabilities we then arrive at the Bayesian interpretation. From the Bayesian perspective, prior probabilities can be encoded as “pseudocounts” in the frequentist framework (i.e. observation counts do not necessarily initialize from zero). In the computer implementations used here there are typically tuned/selected psuedocounts and minimum/maximum probability cutoffs, thus the implementations can be formally described on a Bayesian footing [1, 3].
Whenever you can list all the outcomes for some situation (like rolls on a six‐sided die), it is natural to think of the “probabilities” of those outcomes, where it is also natural for the outcome probabilities sum to one. So, with probability we assume there are “rules” (the probability assignments), and using those rules we make predictions on future outcomes. The rules are a mathematical framework, thus probability is a mathematical encapsulation of outcomes.
How did we get the “rules,” the probability assignments on outcomes? This is the realm of statistics, where we have a bunch of data and we want to distill any rules that we can, such as a complete set of outcomes (observed) and their assigned (estimated) probabilities. If the analysis to go from raw data to a probability model was somehow done in one step, then it could be said that statistics is whatever takes you from raw data to a probability model, and hopefully do so without dependency on a probability model. In practice, however, the statistical determination of a probability model suitable for a collection of data is like the identification of a physical law in mathematical form given raw data – it is math and a lot more, including an iterative creative/inventive process where models are attempted and discarded, and built from existing models.
2.4 Identifying Emergent/Convergent Statistics and Anomalous Statistics
Expectation, E(X), of random variable (r.v.) X:
X is the total of rolling two six‐sided dice: X = 2 can occur in one way, rolling “snake eyes,” while rolling X = 7 can be done in six ways, etc. E(X) = 7. Now consider the expectation for rolling a single die, now E(X) = 3.5. Notice that the value of the expectation need not be one of your possible outcomes (it is really hard to roll a 3.5).
The expectation, E(g(X)), of a function g of r.v. X:
Consider special case g(X) where g(xi ) = −log(p(xi )):
which is Shannon Entropy for the discrete distribution p(xi). For Mutual Information, similarly, use g(X,Y) = log(p(xi , yi )/p(xi )p(yi )) :
if p(xi ), p(yi ), p(xi , yi ) are all ∈ℜ+ , which is the Relative Entropy between a joint distribution and the same distribution if r.v.'s independent: D( p(xi , yi ) ‖ p(xi )p(yi ) ).
Jensen's Inequality:
Let φ(⋅) be a convex function on a convex subset of the real line: φ: χ➔ℜ. Convexity by definition: φ(λ1 x1+…yn xn) ≤ λ1