Группа авторов

Computational Statistics in Data Science


Скачать книгу

real‐world data into actionable conclusions. The ubiquity of data in all economic sectors and scientific disciplines makes data science eminently relevant to cohorts of researchers for whom the discipline of statistics was previously closed off and esoteric. Data science's emphasis on practical application only enhances the importance of computational statistics, the interface between statistics and computer science primarily concerned with the development of algorithms producing either statistical inference1 or predictions. Since both of these products comprise essential tasks in any data scientific workflow, we believe that the pan‐disciplinary nature of data science only increases the number of opportunities for computational statistics to evolve by taking on new applications2 and serving the needs of new groups of researchers.

      But Core Challenges 2 and 3 will also endure: data complexity often increases with size, and researchers strive to understand increasingly complex phenomena. Because many examples of big data become “big” by combining heterogeneous sources, big data often necessitate big models. With the help of two recent examples, Section 3 illustrates how computational statisticians make headway at the intersection of big data and big models with model‐specific advances. In Section 3.1, we present recent work in Bayesian inference for big N and big P regression. Beyond the simplified regression setting, data often come with structures (e.g., spatial, temporal, and network), and correct inference must take these structures into account. For this reason, we present novel computational methods for a highly structured and hierarchical model for the analysis of multistructured and epidemiological data in Section 3.2.

      The growth of model complexity leads to new inferential challenges. While we define Core Challenges 1–3 in terms of generic target distributions or objective functions, Core Challenge 4 arises from inherent difficulties in treating complex models generically. Core Challenge 4 (Section 4.1) describes the difficulties and trade‐offs that must be overcome to create fast, flexible, and friendly “algo‐ware”. This Core Challenge requires the development of statistical algorithms that maintain efficiency despite model structure and, thus, apply to a wider swath of target distributions or objective functions “out of the box”. Such generic algorithms typically require little cleverness or creativity to implement, limiting the amount of time data scientists must spend worrying about computational details. Moreover, they aid the development of flexible statistical software that adapts to complex model structure in a way that users easily understand. But it is not enough that software be flexible and easy to use: mapping computations to computer hardware for optimal implementations remains difficult. In Section 4.2, we argue that Core Challenge 5, effective use of computational resources such as central processing units (CPU), graphics processing units (GPU), and quantum computers, will become increasingly central to the work of the computational statistician as data grow in magnitude.

      2.1 Big N

      Having a large number of observations makes different computational methods difficult in different ways. A worst case scenario, the exact permutation test requires the production of upper N factorial datasets. Cheaper alternatives, resampling methods such as the Monte Carlo permutation test or the bootstrap, may require anywhere from thousands to hundreds of thousands of randomly produced datasets [8, 10]. When, say, population means are of interest, each Monte Carlo iteration requires summations involving upper N expensive memory accesses. Another example of a computationally intensive model is Gaussian process regression [16, 17]; it is a popular nonparametric approach, but the exact method for fitting the model and predicting future values requires matrix inversions that scale script í’ª left-parenthesis upper N cubed right-parenthesis. As the rest of the calculations require relatively negligible computational effort, we say that matrix inversions represent the computational bottleneck for Gaussian process regression.

      To speed up a computationally intensive method, one only needs to speed up the method's computational bottleneck. We are interested in performing Bayesian inference [18] based on a large vector of observations bold </p>
						</div><hr>
						<div class= Скачать книгу