for each of the three virus genomes? Using this as a distance measure phylogenetically speaking, which two viruses are most closely related?
6 2.6 Repeat (Exercise 2.5) but now use symmetrized relative entropy between the trinucleotide probability distributions as a distance measure instead (reevaluate pairwise between the three viruses). Using this as a distance measure phylogenetically speaking, which two viruses are most closely related?
7 2.7 Prove that relative entropy is always positive (hint: use Jensen's Inequality from Section 2.4).
8 2.8 What is the Expectation for the two‐dice roll with pair outcome probabilities listed in (Exercise 2.3)?
9 2.9 What is the Expectation for the two‐dice roll with fair dice? Is this expectation an actual outcome possibility? What does it mean if it is not?
10 2.10 Survey the literature and write a report on common occurrences of distributions of the type: uniform, geometric, exponential, Gaussian, log‐normal, heavy‐tail.
11 2.11 Survey the literature and write a report on common occurrences of series of the type: Martingale.
12 2.12 Consider the loaded die example, where the probability of rolling a 1,2,3,4, or 5, is 0.1, and the probability of rolling a 6 is 0.5.What is the expectation for the loaded die?What is its variance?What is its mode?What is its median?The LLN for the loaded die above indicates that a sequence of rolls could be done and if its average tends toward 4.5, you know it is loaded, and if it goes toward 3.5, you know it is fair. So it comes down to how soon you can resolve that its converging on these two possible expectations differing by 1.0. Suppose someone is rolling a die that is either fair or loaded as described above, how many rolls do you think you will need to see before it will be obvious how the average is trending? Is the better way to spot the anomaly? Like frequency of seeing three sixes in a row is notably skewed?
13 2.13 You have a genomic sequence of length L. (For DNA genomes you have approximately 10**4 for viruses, 10**6 for bacteria, and 10**9 for mammals.) A typical analysis is to get counts on subsequences of length N within the full sequence L, where there are L − N + 1 subsequences of length N (by sliding a window of width N across the length L sequence and taking the window samples accordingly). The number of possible subsequences of length N grows exponentially with increase in that length. For DNA subsequences of length six bases, the 6mers, with four base possibilities, {a,c,g,t}, there are thus 4**6 = 4096 possible 6mers. If the 6mers are equally probable, then in the approximate 10 000 length of a virus each 6mer might be seen a couple times (10 000/4096 to be precise), while a particular 6mer can be seen millions of times in mammalian genomes. Sounds fine so far, but now consider an analysis of 25mers…. The possible 25mers number 4**25 = 2**50 = (2**10)**5 = 1024**5 = 10**15. So, a million billion possibilities…. It turns out that DNA information does not have subsequences with approximately equal statistical counts (equal probabilities), but, instead, is highly structured with a variety of overlapping encoding schemes, so has subsequences with very unequal statistics. The vast majority of the 25mer subsequences, in fact, will have zero counts such that enumeration of the possibilities ahead of time in an array data‐structure is not useful or even possible in some cases, which then leads to associative arrays in this context as shown in the sample code. Do a 25mer analysis on bacterial genome (get from genbank, like E. coli). What is the highest count 25mer subsequence?
14 2.14 A rare genetic disease has probability P(disease) = 0.0000001. Suppose you have a test for this condition with sensitivity that is perfect (SN = 1.0) and with specificity that is 99.99% correct (SP = 0.9999) (i.e. false positives occur with probability 0.0001). Is this test useful in practice?
15 2.15 Prove P(X,Y|Z) = P(X|Z) P(Y|X,Z).
16 2.16 Prove the Bonferroni inequality: P(X,Y) ≥ P(X) + P(Y) − 1.
17 2.17 Suppose you are on a game show (with Monty Hall), and you are given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what is behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to change your pick door to No. 2?” Is it to your advantage to switch your choice? Prove by tabulation of possibilities, and then prove using Bayes' Rule with appropriate choice of variables.
18 2.18 Prove E(Y/X) ≥ E(Y)/E(X).
3 Information Entropy and Statistical Measures
In this chapter, we start with a description of information entropy and statistical measures (Section 3.1). Using these measures we then examine “raw” genomic data. No biology or biochemistry knowledge is needed in doing this analysis and yet we almost trivially rediscover a three‐element encoding scheme that is famous in biology, known as the codon. Analysis of information encoding in the four element {a, c, g, t} genomic sequence alphabet is about as simple as you can get (without working with binary data), so it provides some of the introductory examples that are implemented. A few (simple) statistical queries to get the details of the codon encoding scheme are then straightforward (Section 3.2). Once the encoding scheme is known to exist, further structure is revealed via the anomalous placement of “stop” codons, e.g. anomalously large open reading frames (ORFs) are discovered. A few more (simple) statistical queries from there, and the relation of ORFs to gene structure is revealed (Section 3.3). Once you have a clear structure in the sequential data that can be referenced positionally, it is then possible to gather statistical information for a Markov model. One example of this is to look at the positional base statistics at various positions “upstream” from the start codon. We thereby identify binding sites for critical molecular interaction in both transcription and translation. Since the Markov model is needed in analysis of sequential processes in general for what is discussed in later chapters (Chapters 6 and 7 in particular), a review of Markov models, and some of their specializations, are given in Section 3.4 (Chapters 6 and 7 covers Hidden Markov models, or HMMs).
Numerous prior book, journal, and patent publications by the author are drawn upon throughout the text [1–68]. Almost all of the journal publications are open access. These publications can typically be found online at either the author’s personal website (www.meta‐logos.com) or with one of the following online publishers: www.m‐hikari.com or bmcbioinformatics.biomedcentral.com.
3.1 Shannon Entropy, Relative Entropy, Maxent, Mutual Information
If you have a discrete probability distribution P, with individual components pk, then the rules for probabilities requires that the sum of the probabilities of the individual outcomes must be 1 (as mentioned in Chapter 2). This is written in math shorthand as:
Furthermore, the individual outcome probabilities must always be positive, and by some conventions, nonzero. In the case of hexamers, there are 4096 types, thus the index variable “k” ranges from 1 to 46 = 4096. If we introduce a second discrete probability distribution Q, with individual components qk, those components sum to 1 as well: