Группа авторов

Computational Statistics in Data Science


Скачать книгу

and its gradient. The generic nature of HMC has opened up possibilities for complex Bayesian modeling as early as Neal [80], but its performance is highly sensitive to model parameterization and its three tuning parameters, commonly referred to as trajectory length, step size, and mass matrix [27]. Tuning issues constitute a major obstacle to the wider adoption of the algorithm, as evidenced by the development history of the popular HMC‐based probabilistic programming software Stan [81], which employs the No‐U‐Turn sampler (NUTS) of Hoffman and Gelman [82] to make HMC user‐friendly by obviating the need to tune its trajectory length. Bayesian software packages such as Stan empirically adapt the remaining step size and mass matrix [83]; this approach helps make the use of HMC automatic though is not without issues [84] and comes at the cost of significant computational overhead.

      Although HMC is a powerful algorithm that has played a critical role in the emergence of general‐purpose Bayesian inference software, the challenges involved in its practical deployment also demonstrate how an algorithm – no matter how versatile and efficient at its best – is not necessarily useful unless it can be made easy for practitioners to use. It is also unlikely that one algorithm works well in all situations. In fact, there are many distributions on which HMC performs poorly [83, 85, 86]. Additionally, HMC is incapable of handling discrete distributions in a fully general manner despite the progresses made in extending HMC to such situations [87, 88].

      On the other hand, Gibbs samplers often require little tuning and can take advantage of highly optimized algorithms for each conditional update, as done in the examples of Section 3. A clear advantage of the Gibbs sampler is that it tends to make software implementation quite modular; for example, each conditional update can be replaced with the latest state‐of‐the‐art samplers as they appear [92], and adding a new feature may amount to no more than adding a single conditional update [75]. In this way, an algorithm may not work in a completely model‐agnostic manner but with a broad enough scope can serve as a valuable recipe or meta‐algorithm for building model‐specific algorithms and software. The same is true for optimization methods. Even though its “E”‐step requires a derivation (by hand) for each new model, the EM algorithm [93] enables maximum‐likelihood estimation for a wide range of models. Similarly, variational inference (VI) for approximate Bayes requires manual derivations but provides a general framework to turn posterior computation into an optimization problem [94]. As meta‐algorithms, both EM and VI expand their breadth of use by replacing analytical derivations with Monte Carlo estimators but suffer losses in statistical and computational efficiency [95, 96]. Indeed, such trade‐offs will continue to haunt the creation of fast, flexible, and friendly statistical algo‐ware well into the twenty‐first century.

      4.2 Hardware‐Optimized Inference

      But successful statistical inference software must also interact with computational hardware in an optimal manner. Growing datasets require the computational statistician to give more and more thought to how the computer implements any statistical algorithm. To effectively leverage computational resources, the statistician must (i) identify the routine's computational bottleneck (Section 2.1) and (ii) algorithmically map this rate‐limiting step to available hardware such as a multicore or vectorized CPU, a many‐core GPU, or – in the future – a quantum computer. Sometimes, the first step is clear theoretically: a naive implementation of the high‐dimensional regression example of Section 3.1 requires an order script í’ª left-parenthesis upper N squared upper P right-parenthesis matrix multiplication followed by an order script í’ª left-parenthesis upper P cubed right-parenthesis Cholesky decomposition. Other times, one can use an instruction‐level program profiler, such as INTEL VTUNE (Windows, Linux) or INSTRUMENTS (OSX), to identify a performance bottleneck. Once the bottleneck is identified, one must choose between computational resources, or some combination thereof, based on relative strengths and weaknesses as well as natural parallelism of the target task.

      While a CPU may have tens of cores, GPUs accomplish fine‐grained parallelization with thousands of cores that apply a single instruction set to distinct data within smaller workgroups of tens or hundreds of cores. Quick communication and shared cache memory within each workgroup balance full parallelization across groups, and dynamic on‐ and off‐loading of the many tasks hide the latency that is so problematic for multicore computing. Originally designed for efficiently parallelized matrix math calculations arising from image rendering and transformation, GPUs easily speed up tasks that are tensor multiplication intensive such as deep learning [99] but general‐purpose GPU applications abound. Holbrook et al. [21] provide a larger review of parallel computing within computational statistics. The same paper reports a GPU providing 200‐fold speedups over single‐core processing and 10‐fold speedups over 12‐core AVX processing for likelihood and gradient calculations while sampling from a Bayesian multidimensional scaling posterior using HMC at scale. Holbrook et al. [22] report similar speedups for inference based on spatiotemporal Hawkes processes. Neither application involves matrix or tensor manipulations.