Stephen Winters-Hilt

Informatics and Machine Learning


Скачать книгу

3.4). When applied to the identified coding regions (most of the >500 length ORFs), six gIMMs were used (one for each frame of the codons, with forward and backward read senses). If poorly gIMM‐scoring coding regions were rejected, performance improved, with results slightly better than those of the early Glimmer gene‐prediction software [125] , where an interpolating Markov model was used (but not generalized to permit gaps). More recent versions of Glimmer incorporate start‐codon modeling in order to strengthen predictions. One of the benefits of the gap‐interpolating generalization is that it permits regulatory motifs to be identified, particularly those sharing a common positional alignment with the start‐of‐coding region. Using the bootstrap‐identified genes from the smORF‐based gene‐prediction (including mis‐calls) as a training set permitted an unsupervised search for upstream regulatory structure. The classic Shine‐Dalgarno sequence (the ribosome binding site) was found to be the strongest signal in the 30‐base window upstream from the start codon. Similar results will be found with the full gene‐finder example in Chapter 4.

Image described by caption. Schematic illustration of topology-index histograms are shown for the Chlamydia trachomatis genome, (a), and Deinococcus radiodurans genome, (b) C. trachomatis, like V. cholerae, shows very little overlapping gene structure. D. radiodurans, on the other hand, is dominated by genes that overlap other genes.

      Ab initio gene‐finding can identify the stop codons and, thus, (standard) ORFs. A generalization to codon void regions, with all six frame passes, also leads to recognition of different, overlapping, potential gene regions (and then doubled given the two orientations). A genome‐topology scoring as shown in Figure 3.3 can clearly show differences between bacteria (Figure 3.4) – and is thus a possible “fingerprinting” tool.

      The prokaryotic genome analysis is similar to both the prokaryotic and eukaryotic transciptome analysis (where eukaryotic transcriptome analysis is similar since the introns have been removed). The analysis tools for prokaryotic genomes, described thus far, are primarily what are needed for either prokaryotic or eukaryotic transcriptome analysis. Surprisingly, the same overlapping void topologies, with reverse overlap orientation (“duals”), are seen at transcriptome level in eukaryotes as in prokaryotes. For eukaryotic transcripts with overlaps that are “dual”, however, this has special significance. Recall that a transcript that encodes overlapping read direction “duality” (with regulatory regions intact and lengthy ORF size, so highly likely functional), is only from a single genome‐level pre‐messenger ribonucleic acid (mRNA) due to intron splicing in eukaryotes. This is a very odd arrangement (artifact) for eukaryotes unless they evolved from an ancient prokaryote as hypothesized in a number of theories where such an overlap topology would already be in place to “imprint thru.” The specific nature of this transcriptome artifact, however, is best explained via the viral eukaryogenesis hypothesis (see [1, 3]).

      Just as Chapter 2 finished with a Math review, we do the same again here in the context of sequential processes. The core mathematical tool for describing a sequential process (where limited memory suffices) is the Markov chain, so that will be defined first. In the context of genome analysis, however, the standard Markov chain based feature extraction is no longer optimal (especially given the nature of the computational resources). Thus, novel mathematical generalization of the Markov chain description, interpolated Markov models, will be given as well.The gap/hash interpolated Markov model, in particular, can be used to “vacuum‐up” all motif information in specified regions. This could be used and directly integrated into an HMM‐based gene finder (Chapters 7 and 8), or, alternatively, provide identification of a typical motif