Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

of tasks infeasible and (ii) a moderate number of parameters can render a task impractical when there are a large number of observations. These two insights extend to more complicated models: the same complexity analysis holds for the fitting of generalized linear models (GLMs) as described in McCullagh and Nelder [12].

In the context of Bayesian inference, the length upper P of the vector bold-italic theta dictates the dimension of the MCMC state space. For the M‐H algorithm (Section 2.1) with upper P ‐dimensional Gaussian target and proposal, Gelman et al. [25] show that the proposal distribution's covariance should be scaled by a factor inversely proportional to upper P . Hence, as the dimension of the state space grows, it behooves one to propose states bold-italic theta Superscript asterisk that are closer to the current state of the Markov chain, and one must greatly increase the number upper S of MCMC iterations. At the same time, an increasing upper P often slows down rate‐limiting likelihood calculations (Section 2.1). Taken together, one must generate many more, much slower MCMC iterations. The wide applicability of latent variable models [26] (Sections 3.1 and 3.2) for which each observation has its own parameter set (e.g., upper P proportional-to upper N ) means M‐H simply does not work for a huge class of models popular with practitioners.

For these reasons, Hamiltonian Monte Carlo (HMC) [27] has become a popular algorithm for fitting Bayesian models with large numbers of parameters. Like M‐H, HMC uses an accept step (Equation 2). Unlike M‐H, HMC takes advantage of additional information about the target distribution in the form of the log‐posterior gradient. HMC works by doubling the state space dimension with an auxiliary Gaussian “momentum” variable bold p tilde Normal Subscript upper P Baseline left-parenthesis bold 0 comma bold upper M right-parenthesis independent to the “position” variable bold-italic theta . The constructed Hamiltonian system has energy function given by the negative logarithm of the joint distribution

StartLayout 1st Row 1st Column upper H left-parenthesis bold-italic theta comma bold p right-parenthesis proportional-to minus log left-parenthesis normal pi left-parenthesis bold-italic theta vertical-bar bold upper X right-parenthesis times exp left-parenthesis minus bold p Superscript upper T Baseline bold upper M Superscript negative 1 Baseline bold p slash 2 right-parenthesis right-parenthesis proportional-to minus log normal pi left-parenthesis bold-italic theta vertical-bar bold upper X right-parenthesis plus bold p Superscript upper T Baseline bold upper M Superscript negative 1 Baseline bold p slash 2 2nd Column Blank EndLayout

and we produce proposals by simulating the system according to Hamilton's equations

StartLayout 1st Row 1st Column ModifyingAbove bold-italic theta With dot 2nd Column equals StartFraction partial-differential Over partial-differential bold p EndFraction upper H left-parenthesis bold-italic theta comma bold p right-parenthesis equals upper M Superscript negative 1 Baseline bold p slash 2 2nd Row 1st Column ModifyingAbove bold p With dot 2nd Column equals minus StartFraction partial-differential Over partial-differential bold-italic theta EndFraction upper H left-parenthesis bold-italic theta comma bold p right-parenthesis equals nabla log normal pi left-parenthesis bold-italic theta vertical-bar bold upper X right-parenthesis EndLayout

Thus, the momentum of the system moves in the direction of the steepest ascent for the log‐posterior, forming an analogy with first‐order optimization. The cost is repeated gradient evaluations that may comprise a new computational bottleneck, but the result is effective MCMC for tens of thousands of parameters [21, 28]. The success of HMC has inspired research into other methods leveraging gradient information to generate better MCMC proposals when upper P is large [29].

2.3 Big M

Global optimization, or the problem of finding the minimum of a function with arbitrarily many local minima, is NP‐complete in general [30], meaning – in layman's terms – it is impossibly hard. In the absence of a tractable theory, by which one might prove one's global optimization procedure works, brute‐force grid and random searches and heuristic methods such as particle swarm optimization [31] and genetic algorithms [32] have been popular. Due to the overwhelming difficulty of global optimization, a large portion of the optimization literature has focused on the particularly well‐behaved class of convex functions [33, 34], which do not admit multiple local minima. Since Fisher introduced his “maximum likelihood” in 1922 [35], statisticians have thought in terms of maximization, but convexity theory still applies by a trivial negation of the objective function. Nonetheless, most statisticians safely ignored concavity during the twentieth century: exponential family log‐likelihoods are log‐concave, so Newton–Raphson and Fisher scoring are guaranteed optimality in the context of GLMs [12, 34].

Nearing the end of the twentieth century, multimodality and nonconvexity became more important for statisticians considering high‐dimensional regression, that is, regression with many covariates (big upper P ). Here, for purposes of interpretability and variance reduction, one would like to induce sparsity on the weights vector ModifyingAbove bold-italic theta With Ì‚ by performing best subset selection [36, 37]:

(3)

where 0 less-than k less-than-or-equal-to upper P , and StartAbsoluteValue EndAbsoluteValue dot StartAbsoluteValue EndAbsoluteValue Subscript 0 denotes the script l 0 ‐norm, that is, the number of nonzero elements. Because best subset selection requires an immensely difficult nonconvex optimization, Tibshirani [38] famously replaces the ‐norm with the script l 1 ‐norm, thereby providing

Скачать книгу