further away. After a certain point, however, the mutual information no longer falls off, instead cycling back to a certain level of mutual information with a cycle period of three bases. This suggests that a long‐range three‐element encoding scheme might exist (among other things), which can easily be tested. In doing so we ask Nature “the right question” and the answer is the rediscovery of the codon encoding scheme, as will be shown in what follows.
So, to clarify before proceeding, suppose we want to get information on a three‐element encoding scheme for the Escherichia coli genome (Chromosome 1), say, in file EC_Chr1.fasta.txt. We, therefore, want an order = 3 oligo counting, but on 3‐element windows seen “stepping” across the genome, e.g. a “stepping” window for sampling, not a sliding window, resulting in three choices on stepping, or framing, according to how you take your first step:
case 0: agttagcgcgt ‐‐> (agt)(tag)(cgc)gt
case 1: agttagcgcgt ‐‐> a(gtt)(agc)(gcg)t
case 2: agttagcgcgt ‐‐> ag(tta)(gcg)(cgt)
In the code that follows we get codon counts for a particular frame‐pass (prog2.py addendum 4):
------------------- prog2.py addendum 4 --------------------- # so suspect existence of three-element coding scheme, the codon, # so need stats (anomolous) on codons.... # 'frame' specifies the frame pass as case 0, 1, or 2 in the text. # getting codon counts for a specified framing and specified sequence # will now be shown in two ways, one built from re-use of code blocks, # one from re-use of an entire subroutine: def codon_counter ( seq, frame ): codon_counts = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) # probs = np.empty((0)) for index in range(frame,seqlen-2): if (index+3-frame)%3!=0: continue codon = result[index]+result[index+1]+result[index+2] if codon in codon_counts: codon_counts[codon]+=1 else: codon_counts[codon]=1 counts = np.empty((0)) for i in sorted(codon_counts): counts = np.append(counts,codon_counts[i]+0.0) print "codon", i, "count =", codon_counts[i] probs = count_to_freq(counts) return probs codon_counter(EC_sequence,0) # could also get codon counts by shannon_order with modification to step # (and have order=3 for codon: def shannon_codon( seq, frame ): order=3 stats = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) for index in range(order-1+frame,seqlen): if index%3!=2: continue xmer = "" for xmeri in range(0,order): xmer+=result[index-(order-1)+frame+xmeri] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 for i in sorted(stats): print("%d %s" % (stats[i],i)) counts = np.empty((0)) for i in sorted(stats): counts = np.append(counts,stats[i]+0.0) probs = count_to_freq(counts) return probs shannon_codon(EC_sequence,0) ---------------- prog2.py addendum 4 end -------------------
In running prog2.py addendum 4 we find that the codon “tag” has much lower counts, and similarly for the codon “cta”:
frame 0 have tag with 8970 and cta with 8916
frame 1 have tag with 9407 and cta with 8821
frame 2 have tag with 8877 and cta with 9033
The tag and cta trinucleotides happen to be related – they are reverse compliments of each other (the first hint of information encoding via duplex deoxyribonucleic acid (DNA) with Watson–Crick base‐pairing). There are two other notably rare codons: taa and tga (and their reverse compliment in this all‐frame genome‐wide study as well).
Now that we have identified an interesting feature, such as “tag,” it is reasonable to ask about this feature’s placement across the genome. Having done that, the follow‐up is to identify any anomalously recurring feature proximate to the feature of interest. Such an analysis would need a generic subroutine for getting counts on sub‐strings of indicated order on an indicated reference, to genome sequence data, and that is provided next as an addendum #5 to prog2.py.
-------------------- prog2.py addendum 5 -------------------- # see that 'tag' is anomolous, want to get sense of the distribution on # gaps between 'tag' (still satepping 'in-frame'). def codon_gap_counter ( seq, frame, delimiter ): counts = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) # probs = np.empty((0)) oldindex=0 for index in range(frame,seqlen-2): if (index+3-frame)%3!=0: continue codon = result[index]+result[index+1]+result[index+2] if codon!=delimiter: continue else: gap = index - oldindex quant = 100 bin = gap/quant if oldindex!=0: if bin in counts: counts[bin]+=1 else: counts[bin]=1 oldindex=index npcounts = np.empty((0)) for i in sorted(counts): npcounts = np.append(npcounts,counts[i]+0.0) print "gapbin", i, "count =", counts[i] probs = count_to_freq(npcounts) return probs # usage: delimiters = ("AAA","AAC","AAG","AAT","ACA","ACC","ACG","ACT", "AGA","AGC","AGG","AGT","ATA","ATC","ATG","ATT", "CAA","CAC","CAG","CAT","CCA","CCC","CCG","CCT", "CGA","CGC","CGG","CGT","CTA","CTC","CTG","CTT", "GAA","GAC","GAG","GAT","GCA","GCC","GCG","GCT", "GGA","GGC","GGG","GGT","GTA","GTC","GTG","GTT", "TAA","TAC","TAG","TAT","TCA","TCC","TCG","TCT", "TGA","TGC","TGG","TGT","TTA","TTC","TTG","TTT") for delimiter in delimiters: print "\n\ndelimiter is", delimiter codon_gap_counter(EC_sequence,0,delimiter) ----------------- prog2.py addendum 5 end ------------------
Upon running the above code with codon delimiter set to “tag,” we arrive at Table 3.1, which shows the distribution on (tag) gap sizes. Bin size is 100. So gap bin 0 has the count on all gaps seen sized anywhere from 1 to 99. Bin 1 has counts on occurrences of gaps in the domain 100–199, etc.
Table 3.1 (tag) Gap sizes, with bin size 100.
Gap bin | Count |
---|---|
0 | 2115 |
1 | 1428 |
2 | 1066 |
3 | 829 |
4 | 696 |
5 | 484 |
6 | 399 |
7 | 293 |
8 | 241 |
9 | 222 |
Table 3.2 (aaa) Gap sizes, with bin size 100.
Gap bin | Count |
---|---|
0 | 21 256 |
1 | 7843 |
2 | 3375 |
3 | 1665 |
4 | 827 |
5 | 480 |
6 | 287 |
7 | 163 |
8 | 86 |
9 | 70 |