Группа авторов

Computational Statistics in Data Science


Скачать книгу

of tasks infeasible and (ii) a moderate number of parameters can render a task impractical when there are a large number of observations. These two insights extend to more complicated models: the same complexity analysis holds for the fitting of generalized linear models (GLMs) as described in McCullagh and Nelder [12].

      In the context of Bayesian inference, the length upper P of the vector bold-italic theta dictates the dimension of the MCMC state space. For the M‐H algorithm (Section 2.1) with upper P‐dimensional Gaussian target and proposal, Gelman et al. [25] show that the proposal distribution's covariance should be scaled by a factor inversely proportional to upper P. Hence, as the dimension of the state space grows, it behooves one to propose states bold-italic theta Superscript asterisk that are closer to the current state of the Markov chain, and one must greatly increase the number upper S of MCMC iterations. At the same time, an increasing upper P often slows down rate‐limiting likelihood calculations (Section 2.1). Taken together, one must generate many more, much slower MCMC iterations. The wide applicability of latent variable models [26] (Sections 3.1 and 3.2) for which each observation has its own parameter set (e.g., upper P proportional-to upper N) means M‐H simply does not work for a huge class of models popular with practitioners.

StartLayout 1st Row 1st Column upper H left-parenthesis bold-italic theta comma bold p right-parenthesis proportional-to minus log left-parenthesis normal pi left-parenthesis bold-italic theta vertical-bar bold upper X right-parenthesis times exp left-parenthesis minus bold p Superscript upper T Baseline bold upper M Superscript negative 1 Baseline bold p slash 2 right-parenthesis right-parenthesis proportional-to minus log normal pi left-parenthesis bold-italic theta vertical-bar bold upper X right-parenthesis plus bold p Superscript upper T Baseline bold upper M Superscript negative 1 Baseline bold p slash 2 2nd Column Blank EndLayout

      and we produce proposals by simulating the system according to Hamilton's equations

StartLayout 1st Row 1st Column ModifyingAbove bold-italic theta With dot 2nd Column equals StartFraction partial-differential Over partial-differential bold p EndFraction upper H left-parenthesis bold-italic theta comma bold p right-parenthesis equals upper M Superscript negative 1 Baseline bold p slash 2 2nd Row 1st Column ModifyingAbove bold p With dot 2nd Column equals minus StartFraction partial-differential Over partial-differential bold-italic theta EndFraction upper H left-parenthesis bold-italic theta comma bold p right-parenthesis equals nabla log normal pi left-parenthesis bold-italic theta vertical-bar bold upper X right-parenthesis EndLayout

      2.3 Big M

      Global optimization, or the problem of finding the minimum of a function with arbitrarily many local minima, is NP‐complete in general [30], meaning – in layman's terms – it is impossibly hard. In the absence of a tractable theory, by which one might prove one's global optimization procedure works, brute‐force grid and random searches and heuristic methods such as particle swarm optimization [31] and genetic algorithms [32] have been popular. Due to the overwhelming difficulty of global optimization, a large portion of the optimization literature has focused on the particularly well‐behaved class of convex functions [33, 34], which do not admit multiple local minima. Since Fisher introduced his “maximum likelihood” in 1922 [35], statisticians have thought in terms of maximization, but convexity theory still applies by a trivial negation of the objective function. Nonetheless, most statisticians safely ignored concavity during the twentieth century: exponential family log‐likelihoods are log‐concave, so Newton–Raphson and Fisher scoring are guaranteed optimality in the context of GLMs [12, 34].

      Nearing the end of the twentieth century, multimodality and nonconvexity became more important for statisticians considering high‐dimensional regression, that is, regression with many covariates (big upper P). Here, for purposes of interpretability and variance reduction, one would like to induce sparsity on the weights vector ModifyingAbove bold-italic theta With Ì‚ by performing best subset selection [36, 37]:

      where 0 less-than k less-than-or-equal-to upper P, and StartAbsoluteValue EndAbsoluteValue dot StartAbsoluteValue EndAbsoluteValue Subscript 0 denotes the script l 0‐norm, that is, the number of nonzero elements. Because best subset selection requires an immensely difficult nonconvex optimization, Tibshirani [38] famously replaces the script l 0‐norm with the script l 1‐norm, thereby providing