Группа авторов

Computational Statistics in Data Science


Скачать книгу

EndRoot right-parenthesis"/> (where upper M is the number of 1s) and could be useful for generating p‐values within Monte Carlo simulation from a null distribution (Section 2.1); to obtain the gradient of a function (e.g., the log‐likelihood for Fisher scoring or HMC) with a quantum computer, one only needs to evaluate the function once [103] as opposed to script í’ª left-parenthesis upper P right-parenthesis times for numerical differentiation, and there is nothing stopping the statistician from using, say, a GPU for this single function call; and finally, the HHL algorithm [104] obtains the scalar value bold q Superscript upper T Baseline bold upper M bold q for the upper P‐vector bold q satisfying bold upper A bold q equals bold b and bold upper M and upper P times upper P matrix in time script í’ª left-parenthesis log left-parenthesis upper P kappa squared right-parenthesis right-parenthesis, delivering an exponential speedup over classical methods. Technical caveats exist [105], but HHL may find use within high‐dimensional hypothesis testing (big upper P). Under the null hypothesis, one can rewrite the score test statistic

StartLayout 1st Row 1st Column bold u Superscript upper T Baseline left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis script upper I Superscript negative 1 Baseline left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis bold u left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis as bold u Superscript upper T Baseline left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis script upper I Superscript negative 1 Baseline left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis script upper I left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis script upper I Superscript negative 1 Baseline left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis bold u left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis 2nd Column Blank EndLayout

      for script upper I left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis and bold u left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis, the Fisher information and log‐likelihood gradient evaluated at the maximum‐likelihood solution under the null hypothesis. Letting bold upper A equals script upper I left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis equals bold upper M and bold b equals bold u left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis, one may write the test statistic as bold q Superscript upper T Baseline bold upper M bold q and obtain it in time logarithmic in upper P. When the model design matrix bold upper X is sufficiently sparse – a common enough occurrence in large‐scale regression – to render script upper I left-parenthesis ModifyingAbove bold-italic theta With Ì‚ Subscript 0 Baseline right-parenthesis itself sparse, the last criterion for the application of the HHL algorithm is met.

      Core Challenges 4 and 5 – fast, flexible, and user‐friendly algo‐ware and hardware‐optimized inference – embody an increasing emphasis on application and implementation in the age of data science. Previously undervalued contributions in statistical computing, for example, hardware utilization, database methodology, computer graphics, statistical software engineering, and the human–computer interface [76], are slowly taking on greater importance within the (rather conservative) discipline of statistics. There is perhaps no better illustration of this trend than Dr. Hadley Wickham's winning the prestigious COPSS Presidents' Award for 2019

      [for] influential work in statistical computing, visualization, graphics, and data analysis; for developing and implementing an impressively comprehensive computational infrastructure for data analysis through R software; for making statistical thinking and computing accessible to large audience; and for enhancing an appreciation for the important role of statistics among data scientists [106].

      This success is all the more impressive because Presidents' Awardees have historically been contributors to statistical theory and methodology, not Dr. Wickham's scientific software development for data manipulation [107–109] and visualization [110, 111].

      All of this might lead one to ask: does the success of data science portend the declining significance of computational statistics and its Core Challenges? Not at all! At the most basic level, data science's emphasis on application and implementation underscores the need for computational thinking in statistics. Moreover, the scientific breadth of data science brings new applications and models to the attention of statisticians, and these models may require or inspire novel algorithmic techniques. Indeed, we look forward to a golden age of computational statistics, in which statisticians labor within the intersections of mathematics, parallel computing, database methodologies, and software engineering with impact on the entirety of the applied sciences. After all, significant progress toward conquering the Core Challenges of computational statistics requires that we use every tool at our collective disposal.

      1 1 Statistical inference is an umbrella term for hypothesis testing, point estimation, and the generation of (confidence or credible) intervals for population functionals (mean, median, correlations, etc.) or model parameters.

      2 2 We present the problem of phylogenetic reconstruction in Section 3.2 as one such example arising from the field of molecular epidemiology.

      3 3 The use of “N” and “P” to denote observation and parameter count is common. We have taken liberties in coining the use of “M” to denote mode count.

      4 4 A more numerically stable approach has the same complexity [24].

      5 5