Stephen Winters-Hilt

Informatics and Machine Learning


Скачать книгу

is skewed, we consider some other codon to evaluate, such as for the “aaa” gap, where aaa is most common. The aaa gaps, shown in Table 3.2, tend to be much smaller, with a standard exponential distribution fall‐off indicative of no long‐range encoding linkages:

Schematic illustration of ORF encoding structure is revealed in the V. cholera genome by gaps between stop codons in the genomic sequence.

      Once codon grouping is revealed, where a frequency analysis on codons on the stop codons (TAA, TAG, TGA) shows they are rare. Focusing on the stop codons it is easily found that the gaps between stop codons can be quite anomalous compared to the gaps between other codons (see prog2.py addendum 6):

      ORFs are “open reading frames,” where the reference to what is open is lack of encounter with a stop codon when traversing the genome with a particular codon framing, e.g. ORFs are regions devoid of stop codons when traversed with the codon framing choice of the ORF. When referring to ORFs in most of the analysis we refer to ORFs of length 300 bases or greater. The restriction to larger ORFs is due to their highly anomalous occurrences and likely biological encoding origin (see Figure 3.2), e.g. the long ORFs give a strong indication of containing the coding region of a gene. By restricting to transcripts with ORFs >= 300 in length we have a resulting pool of transcripts that are mostly true coding transcripts.

      Not surprisingly, longer genes stand out clearly in this process, since their anomalous, clearly nonrandom DNA sequence, is being maintained as such, and not randomized by mutation, (as this would be selected against in the survival of the organism that is dependent on the gene revealed).

      The preceding basic analysis can provide a gene‐finder on prokaryotic genomes that comprises a one‐page Python script that can perform with 90–99% accuracy depending on the prokaryotic genome. A second page of Python coding to introduce a “filter,” along the lines of the bootstrap learning process mentioned above, leads to an ab initio prokaryotic gene‐predictor with 98.0–99.9% accuracy. Python code to accomplish this is shown in what follows (Chapter 4). In this process, all that is used is the raw genomic data (with its highly structured intrinsic statistics) and methods for identifying statistical anomalies and informatics structural anomalies: (i) anomalously high mutual information is identified (revealing codon structure); (ii) anomalously high (or low) statistics on an attribute or event is then identified (low stop codon counts, lengthy stop codon voids); then anomalously high sub‐sequences (binding site motifs) are found in the neighborhood of the identified ORFs (used in the filter).

      3.3.1 Ab initio Learning with smORF’s, Holistic Modeling, and Bootstrap Learning