Stephen Winters-Hilt

Informatics and Machine Learning


Скачать книгу

further away. After a certain point, however, the mutual information no longer falls off, instead cycling back to a certain level of mutual information with a cycle period of three bases. This suggests that a long‐range three‐element encoding scheme might exist (among other things), which can easily be tested. In doing so we ask Nature “the right question” and the answer is the rediscovery of the codon encoding scheme, as will be shown in what follows.

      So, to clarify before proceeding, suppose we want to get information on a three‐element encoding scheme for the Escherichia coli genome (Chromosome 1), say, in file EC_Chr1.fasta.txt. We, therefore, want an order = 3 oligo counting, but on 3‐element windows seen “stepping” across the genome, e.g. a “stepping” window for sampling, not a sliding window, resulting in three choices on stepping, or framing, according to how you take your first step:

       case 0: agttagcgcgt ‐‐> (agt)(tag)(cgc)gt

       case 1: agttagcgcgt ‐‐> a(gtt)(agc)(gcg)t

       case 2: agttagcgcgt ‐‐> ag(tta)(gcg)(cgt)

      In the code that follows we get codon counts for a particular frame‐pass (prog2.py addendum 4):

       frame 0 have tag with 8970 and cta with 8916

       frame 1 have tag with 9407 and cta with 8821

       frame 2 have tag with 8877 and cta with 9033

      The tag and cta trinucleotides happen to be related – they are reverse compliments of each other (the first hint of information encoding via duplex deoxyribonucleic acid (DNA) with Watson–Crick base‐pairing). There are two other notably rare codons: taa and tga (and their reverse compliment in this all‐frame genome‐wide study as well).

      Now that we have identified an interesting feature, such as “tag,” it is reasonable to ask about this feature’s placement across the genome. Having done that, the follow‐up is to identify any anomalously recurring feature proximate to the feature of interest. Such an analysis would need a generic subroutine for getting counts on sub‐strings of indicated order on an indicated reference, to genome sequence data, and that is provided next as an addendum #5 to prog2.py.

Gap bin Count
0 2115
1 1428
2 1066
3 829
4 696
5 484
6 399
7 293
8 241
9 222
Gap bin Count
0 21 256
1 7843
2 3375
3 1665
4 827
5 480
6 287
7 163
8 86
9 70