Stephen Winters-Hilt

Informatics and Machine Learning


Скачать книгу

7 allow for a generalized Viterbi Algorithm (see Figure 1.2) and a generalized Baum–Welch Algorithm. The generalized algorithms retain path probabilities in terms of a sequence of likelihood ratios, which satisfy Martingale statistics under appropriate circumstances [102] , thereby having Martingale convergence properties (where here convergence is associated with “learning” in this context). Thus, HMM learning proceeds via convergence to a limit state that provably exists in a similar sense to that shown with the Hoeffding inequality [59] , via its proven extension to Martingales [108] . The Hoeffding inequality is a key part of the VC Theorem in ML, whereby convergence for the Perceptron learning process to a solution in an infinite solution space is proven to exist in a finite number of learning steps [109] . Further details on the Fundamental Theorems [102, 103, 108, 109] are summarized in Appendix C.

      HMM tools have recently been developed with a number of computationally efficient improvements (described in detail in Chapter 7), where application of the HMM methods will be described for gene‐finding, alt‐splice gene‐finding, and nanopore‐detector signal analysis.

Schematic illustration of chunking on a dynamic table. Works for a HMM using a simple join recovery.

      1.5.1 HMMs for Analysis of Information Encoding Molecules

      The main application areas for HMMs covered in this book are power signal analysis generally, and bioinformatics and cheminformatics specifically (the main reviews and applications discussed are from [128–134]). For bioinformatics, we have information encoding molecules that are polymers, giving rise to sequential data format, thus HMMs are well suited for analysis. To begin to understand bioinformatics, however, we need to know not only the biological encoding rules, largely rediscovered on the basis of their statistical anomalies in Chapters 14, but also the idiosyncratic structures seen (genomes and transcriptomes) that are full of evolutionary artifacts and similarities to evolutionary cousins. To know the nature of the statistical imprinting on the polymeric encodings also requires an understanding of the biochemical constraints that give rise to the statistical biases seen. Once taken altogether, bioinformatics offers a lot of clarity on why Nature has settled on the particular genomic “mess,” albeit with optimizations, that it has selectively arrived at. See [1, 3] for further discussion of bioinformatics.

      1.5.2 HMMs for Cheminformatics and Generic Signal Analysis

      HMM is a common intrinsic statistical sequence modeling method (implementations and applications are mainly drawn from [135–158] in what follows), so the question naturally arises – how to optimally incorporate extrinsic “side‐information” into a HMM? This can be done by treating duration distribution information itself as side‐information and a process is shown for incorporating side‐information into a HMM. It is thereby demonstrated how to bootstrap from a HMM to a HMMD (more generally, a hidden semi‐Markov model or HSMM, as it will be described in Chapter 7).

Schematic illustration of edge feature enhancement via HMM/EM EVA filter.