Группа авторов

Computational Statistics in Data Science


Скачать книгу

alt="plus plus"/>'s standard libraries lack many mathematical and statistical operations. However, since Cplus plus can be compiled cross‐platform, developers often interface Cplus plus functions from different languages (e.g., R and Python). Thus, Cplus plus can be used to develop libraries across languages, offering impressive computing performance.

      To enable analysis, developers created mathematical and statistical libraries in Cplus plus. The packages often employ of BLAS (basic linear algebra subprograms) libraries, written in C/Fortran and offer numerous low‐level, high‐performance linear algebra operations on numbers, vectors, and matrices. Some popular BLAS‐compatible libraries include Intel Math Kernel Library (MKL) [12], automatically tuned linear algebra software (ATLAS) [13], OpenBLAS [14], and linear algebra package (LAPACK) [15].

      Among the Cplus plus libraries for mathematics and statistics built on top BLAS, we detail three popular, well‐maintained libraries: Eigen [16], Armandillo [17], and Blaze [18] below:

      Eigen is a high‐level, header‐only library developed by Guennebaud et al. [16]. Eigen provides classes dealing with vector types, arrays, and dense/sparse/large matrices. It also supports matrix decomposition and geometric features. Eigen uses single instruction multiple data vectorization to avoid dynamic memory allocation. Eigen also implements extra features to optimize the computing performance, including unrolling techniques and processor‐cache utilization. Eigen itself does not take much advantage from parallel hardware, currently supporting parallel processing only for general matrix–matrix products. However, since Eigen uses BLAS‐compatible libraries, users can utilize external BLAS libraries in conjunction with Eigen for parallel computing. Python and R users can call Eigen functions using the minieigen and RcppEigen packages.

      Blaze is a high‐performance math library for dense/sparse arithmetic developed by Iglberger et al. [18]. Blaze extensively uses LAPACK functions for various computing tasks, such as matrix decomposition and inversion, providing high‐performance computing. Blaze supports high‐performance parallex (HPX) [20] and OpenMP to enable parallel computing.

      The difficulty to develop Cplus plus programs limits its use as a primary statistical software package. Yet, Cplus plus appeals when a fast, production‐quality program is desired. Therefore, R and Python developers may find Cplus plus knowledge beneficial to optimize their code prior to distribution. We see C/Cplus plus as the standard for speed and, as such, an attractive tool for big data problems.

      3.3 Microsoft Excel/Spreadsheets

      Much of statistical work today involves the use of Microsoft Excel and other spreadsheet‐style applications (Google Sheets, Apple Numbers, etc.). A spreadsheet application provides a simple and interactive way to collect data. This has an appeal for any manual data entry process. The sheets are easy to share, both through traditional file sharing (e.g., e‐mail attachments) and cloud‐based solutions (Google Drive, Dropbox, etc.). Simple numeric summaries and plots are easy to construct. More advanced macros/scripts are possible, yet most data scientists would prefer to switch to a more full‐featured environment (such as R or Python). Yet, as nearly all computer workers have some level of familiarity with spreadsheets, spreadsheets remain hugely popular and ubiquitous in organizations. Thus, we wager that spreadsheet applications will likely always be involved in statistical software and posit they can be quite efficient for appropriate tasks.

      3.4 Git

      Very briefly, we mention Git, a free and open‐source distributed version control system (https://git‐scm.com/). As the complexities of modern data science workflows increase, statistical programmers are increasingly reliant on some type of version control system, with Git being the most popular. Git allows for a branching scheme to foster experimentation in projects and to converge to a final product. By compiling a complete history of a project, Git provides transparent data analyses for reproducible research. Further, projects and software can be shared easily via web‐based repositories, such as GitHub (https://github.com/).

      3.5 Java

      Developers may prefer Java for intensive calculations performing slowly within scripted languages (e.g., R). For speed‐up purposes, Java's cross‐platform design could even be preferred to C/Cplus plus in certain cases. Alternatively, Java code can be wrapped nicely in an R package for faster processing. For example, the rJava package allows one to call java code in an R script and also reversely (calling R functions in Java). On the other hand, Java can be used independently for statistical analysis, thanks to a nice set of statistical libraries.

      Popular sources of native Java statistical and mathematical functionalities are JSC (Java Statistical Classes) and Apache Commons Math application programming interfaces (APIs) (http://commons.apache.org/proper/commons‐math/). JSC and Apache Commons Math libraries perform many methods including univariate statistics, parametric and nonparametric tests (t‐test, chi‐square test, and Wilcoxon test), random number generation, random sampling/resampling, regression, correlation, linear or stochastic optimization, and clustering.

      Additionally, Java boasts an extensive number of machine‐learning packages and big data capabilities. For example, Java enables the WEKA [21] tool, the JSAT library [22], and the TensorFlow framework [23]. Moreover, Java provides one of the most desired and useful big data analysis tools – Apache Spark [24]. Spark provides ML support through modules in the Spark MLlib library [25].

      As with other discussed software, Java APIs often require importing other packages/libraries. For example, developers commonly