Группа авторов

Computational Statistics in Data Science


Скачать книгу

large arrays and matrices, optimized for speed via a C implementation. The package features a dense, homogeneous array called ndarray. ndarray provides computational efficiency and flexibility. Developers consider NumPy a low‐level tool as only foundational functions are available. To enhance capabilities, other statistical libraries and packages use NumPy to provide richer features.

      One widely used higher level package, SciPy, employs NumPy to enable engineering and data science [6]. SciPy contains modules addressing standard problems in scientific computing, such as mathematical integration, linear algebra, optimization, statistics, clustering, image, and signal processing.

      Another higher level Python package built upon NumPy, Pandas, is designed particularly for data analysis, providing standard models and cohesive frameworks [7]. Pandas implements a data type named DataFrame – a concept similar to the data.frame object in R. DataFrame's structure features efficient methods for data sorting, splicing, merging, grouping, and indexing. Pandas implements robust input/output tools – supporting flat files, Excel files, databases, and HDF files. Additionally, Pandas provides visualization methods via Matplotlib [9].

      Lastly, the package Statsmodels facilitates data exploration, estimation, and statistical testing [8]. Built at even a higher level than the other packages discussed, Statsmodels employs NumPy, SciPy, Pandas, and Matplotlib. Many statistical models exist, such as linear regression, generalized linear models, probability distributions, and time series. See http://www.statsmodels.org/stable/index.html for the full feature list.

      In addition to the four libraries discussed above, Python features numerous other bespoke packages for a particular task. For ML, the TensorFlow and PyTorch packages are widely used, and for Bayesian inference, Pyro and NumPyro are becoming popular (see more on these packages in Section 4). For big data computations, PySpark provides scalable tools to handle memory and computation time issues. For advanced data visualization, pyplot, seaborn, and plotnine may be worth adopting for a Python‐inclined data scientist.

      Python's easy‐to‐learn syntax, speed, and versatility make it a favorite among programmers. Moreover, the packages listed above transform Python into a well‐developed vehicle for data science. We see Python's popularity only increasing in the future. Some believe that Python will eventually eliminate the need for R. However, we feel that the immediate future lies in a Python + R paradigm. Thus, R users may well consider exploring what Python offers as the languages have complementary features.

      2.3 SAS®

      SAS was born during the late 1960s, within the Department of Experimental Statistics at North Carolina State University. As the software developed, the SAS Institute was formed in 1976. Since its infancy, SAS has evolved into an integrated system for data analysis and exploration. The SAS system has been used in numerous business areas and academic institutions worldwide.

      Recently, SAS's popularity has diminished [4]; yet, it remains widely used. Open‐source competitors threaten SAS's previous overall market dominance. Rather than complete removal, we see SAS becoming a niche product in the future. Now, however, SAS expertise remains desired in certain roles and industries.

      2.4 SPSS®

      Norman H. Nie, C. Hadlai (Tex) Hul, and Dale Brent developed SPSS in the late 1960s. The trio were Stanford University graduate students at the time. SPSS was founded in 1968 and incorporated in 1975. SPSS became publicly traded in 1993. Now, IBM owns the rights to SPSS. Originally, developers designed SPSS for mainframe use. In 1984, SPSS introduced SPSS/PCplus for computers running MS‐DOS, followed by a UNIX release in 1988 and a Macintosh version in 1990. SPSS features an intuitive point‐and‐click interface. This design empowers a broad user base to conduct standard analyses.

      SPSS features a wide variety of analytic capabilities including one for regression, classification trees, table creation, exact tests, categorical analysis, trend analysis, conjoint analysis, missing value analysis, map‐based analysis, and complex samples analysis. In addition, SPSS supports numerous stand‐alone products including Amos™ (a structural equation modeling package), SPSS Text Analysis for Surveys™ (a survey analysis package utilizing natural language processing (NLP) methodology), SPSS Data Entry™ (a web‐based data entry package; see Web Based Data Management in Clinical Trials), AnswerTree® (a market segment targeting package), SmartViewer® Web Server™ (a report‐generation and dissemination package), SamplePower® ( sample size calculation package), DecisionTime® and What if?™ (a scenario analysis package for the nonspecialist), SmartViewer® for Windows (a graph/report sharing utility), SPSS WebApp Framework (web‐based analytics package), and the Dimensions Development Library (a data capture library).

      SPSS remains popular, especially in scholarly work [4]. For many researchers whom apply standard models, SPSS gets the job done. We see SPSS remaining a useful tool for practitioners across many fields.

      Next, we discuss noteworthy statistical software, aiming to provide essential details for a fairly complete survey of the most commonly used statistical software and related tools.

      3.1 BUGS/JAGS

      JAGS (Just Another Gibbs Sampler) [11] was developed as a cross‐platform engine for the BUGS modeling language. A secondary goal was to provide extensibility, allowing user‐specific functions, distributions, and sampling algorithms. The BUGS/JAGS approach to specifying probabilistic models has become standard in other related software (e.g., NIMBLE). Both BUGS and JAGS are still widely used and are well suited for tasks of small‐to‐medium complexity. However, for highly complex models and big data problems there are similar, more‐powerful Bayesian inference engines emerging, for example, STAN and Pyro (see Section 4 for more details).

      3.2 C++

      Cplus plus is a general‐purpose, high‐performance programming language. Unlike other scripting languages for statistics such as R and Python, Cplus plus is a compiled language – adding complexity (such as memory management) and strict syntax requirements. As such, C's design may complicate prototyping. Thus, data scientists typically turn to Cplus plus to optimize/scale a developed algorithm at the production level.

      C