Группа авторов

Computational Statistics in Data Science


Скачать книгу

StartLayout 1st Row 1st Column bold y tilde Normal Subscript upper N Baseline left-parenthesis bold upper X bold upper Gamma bold-italic theta comma sigma squared bold upper I Subscript upper N Baseline right-parenthesis for left-bracket bold upper Gamma right-bracket Subscript p p Sub Superscript prime Subscript Baseline equals Start 2 By 2 Matrix 1st Row 1st Column gamma Subscript p Baseline tilde Bernoulli left-parenthesis normal pi right-parenthesis 2nd Column p equals p Superscript prime Baseline 2nd Row 1st Column 0 2nd Column p not-equals p Superscript prime Baseline EndMatrix and normal pi element-of left-parenthesis 0 comma 1 right-parenthesis 2nd Column Blank EndLayout

      In the following section, we present an alternative Bayesian sparse regression approach that mitigates the combinatorial problem along with a state‐of‐the‐art computational technique that scales well both in upper N and upper P.

      These challenges will remain throughout the twenty‐first century, but it is possible to make significant advances for specific statistical tasks or classes of models. Section 3.1 considers Bayesian sparse regression based on continuous shrinkage priors, designed to alleviate the heavy multimodality (big upper M) of the more traditional spike‐and‐slab approach. This model presents a major computational challenge as upper N and upper P grow, but a recent computational advance makes the posterior inference feasible for many modern large‐scale applications.

      And because of the rise of data science, there are increasing opportunities for computational statistics to grow by enabling and extending statistical inference for scientific applications previously outside of mainstream statistics. Here, the science may dictate the development of structured models with complexity possibly growing in upper N and upper P. Section 3.2 presents a method for fast phylogenetic inference, where the primary structure of interest is a “family tree” describing a biological evolutionary history.

      3.1 Bayesian Sparse Regression in the Age of Big N and Big P

      With the goal of identifying a small subset of relevant features among a large number of potential candidates, sparse regression techniques have long featured in a range of statistical and data science applications [46]. Traditionally, such techniques were commonly applied in the “upper N less-than-or-equal-to upper P” setting, and correspondingly computational algorithms focused on this situation [47], especially within the Bayesian literature [48].

      Due to a growing number of initiatives for large‐scale data collections and new types of scientific inquiries made possible by emerging technologies, however, increasingly common are datasets that are “big upper N” and “big upper P” at the same time. For example, modern observational studies using health‐care databases routinely involve upper N almost-equals 1 0 Superscript 5 Baseline tilde 1 0 Superscript 6 patients and upper P almost-equals 1 0 Superscript 4 Baseline tilde 1 0 Superscript 5 clinical covariates [49]. The UK Biobank provides brain imaging data on upper N equals 100 000 patients, with upper P equals 100 tilde 200 000, depending on the scientific question of interests [50]. Single‐cell RNA sequencing can generate datasets with upper N (the number of cells) in millions and upper P (the number of genes) in tens of thousands, with the trend indicating further growths in data size to come [51].

      3.1.1 Continuous shrinkage: alleviating big M

theta Subscript p Baseline vertical-bar lamda Subscript p Baseline comma tau tilde Normal Subscript upper N Baseline left-parenthesis 0 comma tau squared lamda Subscript p Superscript 2 Baseline right-parenthesis comma lamda Subscript p Baseline tilde pi Subscript local Baseline left-parenthesis dot right-parenthesis comma tau tilde pi Subscript global Baseline left-parenthesis dot right-parenthesis

      The idea is that the global scale parameter tau less-than-or-equal-to 1 would shrink most