Stephen Winters-Hilt

Informatics and Machine Learning


Скачать книгу

functionalization the transducer molecule is drawn into the channel by an applied potential but is too big to translocate, instead becoming stuck in a bistable capture such that it modulates the channel’s ion‐flow with stationary statistics in a distinctive way. If the channel modulator is bifunctional in that one end is meant to be captured and modulated while the other end is linked to an aptamer or antibody for specific binding, then we have the basis for a remarkably sensitive and specific biosensing capability.

      In the NTD Nanoscope experiments [2] , the molecular dynamics of a (single) captured non‐translocating transducer molecule provide a unique stochastic reference signal with stable statistics on the observed, single‐molecule blockaded channel current, somewhat analogous to a carrier signal in standard electrical engineering signal analysis. Discernible changes in blockade statistics, coupled with SSA signal processing protocols, enable the means for a highly detailed characterization of the interactions of the transducer molecule with binding targets (cognates) in the surrounding (extra‐channel) environment.

      Thus, in Nanoscope applications of the SSA Protocol, due to the molecular dynamics of the captured transducer molecule, a unique reference signal with strongly stationary (or weakly, or approximately stationary) signal statistics is engineered to be generated during transducer blockade, analogous to a carrier signal in standard electrical engineering signal analysis. In these applications a signal is deemed “strongly” stationary if the EM/EVA projection (HMM method from Chapter 6) on the entire dataset of interest produces a discrete set of separable (non‐fuzzy domain) states. A signal is deemed “weakly” stationary if the EM/EVA projection can only produce a discrete set of states on subsegments (windowed sections) of the data sequence, but where state‐tracking is possible across windows (i.e. the non‐stationarity is sufficiently slow to track states – similar to the adiabatic criterion in statistical mechanics). A signal is approximately stationary, in a general sense, if it is sufficiently stationary to still benefit, to some extent, from the HMM‐based signal processing tools (that assume stationarity).

      The adaptive SSA ML algorithms, for real‐time analysis of the stochastic signal generated by the transducer molecule can easily offer a “lock and key” level of signal discrimination. The heart of the signal processing algorithm is a generalized Hidden Markov Model (gHMM)‐based feature extraction method, implemented on a distributed processing platform for real‐time operation. For real‐time processing, the gHMM is used for feature extraction on stochastic sequential data, while classification and clustering analysis are implemented using a SVM. In addition, the design of the ML‐based algorithms allow for scaling to large datasets, via real‐time distributed processing, and are adaptable to analysis on any stochastic sequential dataset. The ML software has also been integrated into the NTD Nanoscope [2] for “real‐time” pattern‐recognition informed (PRI) feedback [1–3] (see Chapter 14 for results). The methods used to implement the PRI feedback include distributed HMM and SVM implementations, which enable the processing speedup that is needed.

      1.9.2 Nanoscope Cheminformatics – A Case Study for Device “Smartening”

      ML provides a solution to the “Big Data” problem, whereby a vast amount of data is distilled down to its information essence. The ML solution sought is usually required to perform some task on the raw data, such as classification (of images) or translation of text from one language to another. In doing so, ML solutions are strongly favored where a clear elucidation of the features used in the classification are also revealed. This then allows a more standard engineering design cycle to be accessed, where the stronger features thereby identified may play a stronger role, or guide the refinement of related strong features, to arrive at an improved classifier. This is what is accomplished with the previously mentioned SSA Protocol.

      So, given the flexibility of the SSA Protocol to “latch on” to signal that has a reasonable set of features, you might ask what is left? (Note that, all communication protocols, both natural (genomic) and man‐made, have a “reasonable” set of features.) The answer is simply when the number of features is “unreasonable” (with enumeration not even known, typically). So instead of 100 features, or maybe 1000, we now have a situation with 100 000 to 100s of millions of features (such as with sentence translation or complex image classification). Obviously Big Data is necessary to learn with such a huge number of features present, so we are truly in the realm of Big Data to even begin with such problems, but now have the Big Features issue (e.g. Big Data with Big Features, or BDwBF). What must occur in such problems is a means to wrangle the almost intractable large feature set of information to a much smaller feature set of information, e.g. an intial layer of processing is needed just to compress the feature data. In essence, we need a form of compressive feature extraction at the outset in order to not overwhelm the acquisition process. An example from the biology of the human eye, is the layer of local neural processing at the retina before the nerve impulses even travel on to the brain for further layers of neural processing.

      Throughout the text an effort is made to provide mathematical specifics to clearly understand the theoretical underpinnings of the methods. This provides a strong exposition of the theory but the motivation for this is not to do more theory, but to then proceed to a clearly defined computational implementation. This is where mathematical elegance meets implementation/computational practicality (and the latter wins). In this text, the focus is almost entirely on elegent methods that also have highly efficient computational implementations.