Stephen Winters-Hilt

Informatics and Machine Learning


Скачать книгу

left-bracket upper P left-parenthesis x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 5 Baseline right-parenthesis double-vertical-bar upper P left-parenthesis x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline right-parenthesis upper P left-parenthesis x Subscript negative 5 Baseline right-parenthesis right-bracket greater-than upper D left-bracket upper P left-parenthesis x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline semicolon x Subscript negative 3 Baseline right-parenthesis double-vertical-bar upper P left-parenthesis x 0 semicolon x Subscript negative 1 Baseline semicolon x Subscript negative 2 Baseline right-parenthesis upper P left-parenthesis x Subscript negative 3 Baseline right-parenthesis right-bracket period"/>

      1 3.1 In Section 3.1, the Maximum Entropy Principle is introduced. Using the Lagrangian formalism, find a solution that maximizes on Shannon entropy subject to the constraint of the “probabilities” sum to one.

      2 3.2 Repeat the Lagrangian optimization of (Exercise 3.1) subject to the added constraint that there is a mean value, E(X) = μ.

      3 3.3 Repeat the Lagrangian optimization of (Exercise 3.2) subject to the added constraint that there is a variance value, Var(X) = E(X2)(E(X))2 = σ2.

      4 3.4 Using the two‐die roll probabilities from (Exercise 2.3) compute the mutual information between the two die using the relative entropy form of the definition. Compare to the pure Shannon definition: MI(X, Y) = H(X) + H(Y) – H(X, Y).

      5 3.5 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank) and select the genomes of three medius‐sized bacteria (~1 Mb), where two bacteria are closely related. Using the Python code shown in Section 2.1, determine their hexamer frequencies (as in Exercise 2.5 with virus genomes). What is the Shannon entropy of the hexamer frequencies for each of the three bacterial genomes? Consider the following three ways to evaluate distances between the genome hexamer‐frequency profiles (denoted Freq(genome1), etc.), try each, and evaluate their performance at revealing the “known” (that two of the bacteria are closely related):distance = Shannon difference = | H(Freq(genome1))−H(Freq(genome2))|.distance = Euclidean distance = d(Freq(genome1),Freq(genome2)).distance = Symmetrized Relative Entropy= [D(Freq(genome1)||Freq(genome2))+D(Freq(genome2)||Freq(genome1))]/2Which distance measure provides the clearest identification of phylogenetic relationship? Typically it should be (iii).

      6 3.6 Exercise 3.5, if done repeatedly, will eventually reveal that the best distance measure (between distributions) is the symmetrized relative entropy (case (iii)). Notice that this means that when comparing two distributions we quantify their difference not by a difference on Shannon entropies, case (i). In other words, we choose:Difference(X, Y) = MI(X, Y) = H(X) + H(Y) – H(X, Y),Not Difference = ∣ H(X) – H(Y)∣.The latter case satisfies the metric properties, including triangle inequality, in order to be a “distance” measure, is this true for the mutual information difference as well?

      7 3.7 Go to genbank (https://www.ncbi.nlm.nih.gov/genbank) and select the genome of the K‐12 strain of E. coli. (The K‐12 strain was obtained from the stool sample of a diphtheria patient in Palo Alto, CA, in 1922, so that seems like a good one.) Reproduce the MI codon discovery described in Figure 3.1.

      8 3.8 Using the E. coli genome (the one described above) and using the codon counter code, get the frequency of occurrence of the 64 different codons genome‐wide (without even restricting to coding regions or to a particular “framing,” these are still unknowns, initially, in an ab initio analysis). This should reveal oddly low counts for what will turn out to be the “stop” codons.

      9 3.9 Using the code examples with stops used to mark boundaries, identify long ORF regions in the E. coli genome (the one described in (Exercise 3.7)). Produce an ORF length histogram like that shown in Figure 3.2.

      10 3.10 Create an overlap‐encoding topology scoring method (like that described for Figure 3.3), and use it to obtain topology histograms like those shown in Figure 3.3a and Figure 3.4. Do this for the following genomes: E. coli (K‐12 strain); V. cholera; and Deinococcus Radiodurans.

      11 3.11 In a highly trusted coding region, such as the top 10% of the longest ORFs in the E. coli genome analysis, perform a gap IMM analysis, which should approximately reproduce the result shown in Figure 3.5.

      12 3.12 Prove that relative entropy is always positive.

      Конец ознакомительного фрагмента.

      Текст предоставлен ООО «ЛитРес».

      Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.

      Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.

/9j/4AAQSkZJRgABAQEBLAEsAAD/7SNMUGhvdG9zaG9wIDMuMAA4QklNBAQAAAAAAA8cAVoAAxsl RxwCAAACAAAAOEJJTQQlAAAAAAAQzc/6fajHvgkFcHaurwXDTjhCSU0EOgAAAAAA9wAAABAAAAAB AAAAAAALcHJpbnRPdXRwdXQAAAAFAAAAAFBzdFNib29sAQAAAABJbnRlZW51bQAAAABJbnRlAAAA AENscm0AAAAPcHJpbnRTaXh0ZWVuQml0Ym9vbAAAAAALcHJpbnRlck5hbWVURVhUAAAACgBBAGQA bwBiAGUAIABQAEQARgAAAAAAD3ByaW50UHJvb2ZTZXR1cE9iamMAAAAMAFAAcgBvAG8AZgAgAFMA ZQB0AHUAcAAAAAAACnByb29mU2V0dXAAAAABAAAAAEJsdG5lbnVtAAAADGJ1aWx0aW5Qcm9vZgAA AAlwcm9vZkNNWUsAOEJJTQQ7AAAAAAItAAAAEAAAAAEAAAAAABJwcmludE91dHB1dE9wdGlvbnMA AAAXAAAAAENwdG5ib29sAAAAAABDbGJyYm9vbAAAAAAAUmdzTWJvb2wAAAAAAENybkNib29sAAAA AABDbnRDYm9vbAAAAAAATGJsc2Jvb2wAAAAAAE5ndHZib29sAAAAAABFbWxEYm9vbAAAAAAASW50 cmJvb2wAAAAAAEJja2dPYmpjAAAAAQAAAAAAAFJHQkMAAAADAAAAAFJkICBkb3ViQG/gAAAAAAAA AAAAR3JuIGRvdWJAb+AAAAAAAAAAAABCbCAgZG91YkBv4AAAAAAAAAAAAEJyZFRVbnRGI1JsdAAA AAAAAAAAAAAAAEJsZCBVbnRGI1JsdAAAAAAAAAAAAAAAAFJzbHRVbnRGI1B4bEBywAAAAAAAAAAA CnZlY3RvckRhdGFib29sAQAAAABQZ1BzZW51bQAAAABQZ1BzAAAAAFBnUEMAAAAATGVmdFVudEYj Umx0AAAAAAAAAAAAAAAAVG9wIFVudEYjUmx0AAAAAAAAAAAAAAAAU2NsIFVudEYjUHJjQFkAAAAA AAAAAAAQY3JvcFdoZW5QcmludGluZ2Jvb2wAAAAADmNyb3BSZWN0Qm90dG9tbG9uZwAAAAAAAAAM Y3JvcFJlY3RMZWZ0bG9uZwAAAAAAAAANY3JvcFJlY3RSaWdodGxvbmcAAAAAAAAAC2Nyb3BSZWN0 VG9wbG9uZwAAAAAAOEJJTQPtAAAAAAAQASwAAAABAAIBLAAAAAEAAjhCSU0EJgAAAAAADgAAAAAA AAAAAAA/gAAAOEJJTQQNAAAAAAAEAAAAWjhCSU0EGQAAAAAABAAAAB44QklNA/MAAAAAAAkAAAAA AAAAAAEAOEJJTScQAAAAAAAKAAEAAAAAAAAAAjhCSU0D9QAAAAAASAAvZmYAAQBsZmYABgAAAAAA AQAvZmYAAQChmZoABgAAAAAAAQAyAAAAAQBaAAAABgAAAAAAAQA1AAAAAQAtAAAABgAAAAAAAThC SU0D+AAAAAAAcAAA/////////////////////////////wPoAAAAAP////////////////////// //////8D6AAAAAD/////////////////////////////A+gAAAAA//////////////////////// /////wPoAAA4QklNBAgAAAAAABAAAAABAAACQAAAAkAAAAAAOEJJTQQeAAAAAAAEAAAAADhCSU0E GgAAAAADTwAAAAYAAAAAAAAAAAAACtcAAAdnAAAADQA5ADcAOAAxADEAMQA5ADcAMQA2ADcANAA3 AAAAAQAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAAAAAAdnAAAK1wAAAAAAAAAAAAAAAAAAAAAB AAAAAAAAAAAAAAAAAAAAAAAAABAAAAABAAAAAAAAbnVsbAAAAAIAAAAGYm91bmRzT2JqYwAAAAEA AAAAAABSY3QxAAAABAAAAABUb3AgbG9uZwAAAAAAAAAATGVmdGxvbmcAAAAAAAAAAEJ0b21sb25n AAAK1wAAAABSZ2h0bG9uZwAAB2cAAAAGc2xpY2VzVmxMcwAAAAFPYmpjAAAAAQAAAAAABXNsaWNl AAAAEgAAAAdzbGljZUlEb