Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

alt="plus plus"/>'s standard libraries lack many mathematical and statistical operations. However, since C can be compiled cross‐platform, developers often interface C functions from different languages (e.g., R and Python). Thus, C can be used to develop libraries across languages, offering impressive computing performance.

To enable analysis, developers created mathematical and statistical libraries in C plus plus . The packages often employ of BLAS (basic linear algebra subprograms) libraries, written in C/Fortran and offer numerous low‐level, high‐performance linear algebra operations on numbers, vectors, and matrices. Some popular BLAS‐compatible libraries include Intel Math Kernel Library (MKL) [12], automatically tuned linear algebra software (ATLAS) [13], OpenBLAS [14], and linear algebra package (LAPACK) [15].

Among the C plus plus libraries for mathematics and statistics built on top BLAS, we detail three popular, well‐maintained libraries: Eigen [16], Armandillo [17], and Blaze [18] below:

Eigen is a high‐level, header‐only library developed by Guennebaud et al. [16]. Eigen provides classes dealing with vector types, arrays, and dense/sparse/large matrices. It also supports matrix decomposition and geometric features. Eigen uses single instruction multiple data vectorization to avoid dynamic memory allocation. Eigen also implements extra features to optimize the computing performance, including unrolling techniques and processor‐cache utilization. Eigen itself does not take much advantage from parallel hardware, currently supporting parallel processing only for general matrix–matrix products. However, since Eigen uses BLAS‐compatible libraries, users can utilize external BLAS libraries in conjunction with Eigen for parallel computing. Python and R users can call Eigen functions using the minieigen and RcppEigen packages.

The National ICT Australia (NICTA) developed the open‐source library Armadillo to facilitate science and engineering [17]. Armadillo provides a fast, easy‐to‐use matrix library with MATLAB‐like syntax. Armadillo employs template meta‐programming techniques to avoid unnecessary operations and increase library performance. Further, Armadillo supports 3D objects and provides numerous utilities for matrices manipulation and decomposition. Armadillo automatically utilizes open multiprocessing (OpenMP) [19] to increase speed. Developers designed Armadillo to provide a balance between speed and ease of use. Armadillo is widely used for many applications in ML, pattern recognition, signal processing, and bioinformatics. R users may call Armadillo functions through the RcppArmadillo package.

Blaze is a high‐performance math library for dense/sparse arithmetic developed by Iglberger et al. [18]. Blaze extensively uses LAPACK functions for various computing tasks, such as matrix decomposition and inversion, providing high‐performance computing. Blaze supports high‐performance parallex (HPX) [20] and OpenMP to enable parallel computing.

The difficulty to develop C plus plus programs limits its use as a primary statistical software package. Yet, C appeals when a fast, production‐quality program is desired. Therefore, R and Python developers may find C knowledge beneficial to optimize their code prior to distribution. We see C/C as the standard for speed and, as such, an attractive tool for big data problems.

3.3 Microsoft Excel/Spreadsheets

Much of statistical work today involves the use of Microsoft Excel and other spreadsheet‐style applications (Google Sheets, Apple Numbers, etc.). A spreadsheet application provides a simple and interactive way to collect data. This has an appeal for any manual data entry process. The sheets are easy to share, both through traditional file sharing (e.g., e‐mail attachments) and cloud‐based solutions (Google Drive, Dropbox, etc.). Simple numeric summaries and plots are easy to construct. More advanced macros/scripts are possible, yet most data scientists would prefer to switch to a more full‐featured environment (such as R or Python). Yet, as nearly all computer workers have some level of familiarity with spreadsheets, spreadsheets remain hugely popular and ubiquitous in organizations. Thus, we wager that spreadsheet applications will likely always be involved in statistical software and posit they can be quite efficient for appropriate tasks.

3.4 Git

Very briefly, we mention Git, a free and open‐source distributed version control system (https://git‐scm.com/). As the complexities of modern data science workflows increase, statistical programmers are increasingly reliant on some type of version control system, with Git being the most popular. Git allows for a branching scheme to foster experimentation in projects and to converge to a final product. By compiling a complete history of a project, Git provides transparent data analyses for reproducible research. Further, projects and software can be shared easily via web‐based repositories, such as GitHub (https://github.com/).

3.5 Java

Java is one of the most popular programming languages (according to the TIOBE index, www.tiobe.com/tiobe‐index/), partially due to its extensive library ecosystem. Java's design seduces programmers – it is simple, object oriented, and portable. Java applications run on any machine, from personal laptops to high‐performance supercomputers, even game consoles and internet of things (IoT) devices. Notably, Android (based on Java) development has driven recent Java innovations. Java's “write once, run anywhere” adage provides versatility, triggering interest even at the research level.

Developers may prefer Java for intensive calculations performing slowly within scripted languages (e.g., R). For speed‐up purposes, Java's cross‐platform design could even be preferred to C/C plus plus in certain cases. Alternatively, Java code can be wrapped nicely in an R package for faster processing. For example, the rJava package allows one to call java code in an R script and also reversely (calling R functions in Java). On the other hand, Java can be used independently for statistical analysis, thanks to a nice set of statistical libraries.

Popular sources of native Java statistical and mathematical functionalities are JSC (Java Statistical Classes) and Apache Commons Math application programming interfaces (APIs) (http://commons.apache.org/proper/commons‐math/). JSC and Apache Commons Math libraries perform many methods including univariate statistics, parametric and nonparametric tests (‐test, chi‐square test, and Wilcoxon test), random number generation, random sampling/resampling, regression, correlation, linear or stochastic optimization, and clustering.

Additionally, Java boasts an extensive number of machine‐learning packages and big data capabilities. For example, Java enables the WEKA [21] tool, the JSAT library [22], and the TensorFlow framework [23]. Moreover, Java provides one of the most desired and useful big data analysis tools – Apache Spark [24]. Spark provides ML support through modules in the Spark MLlib library [25].

As with other discussed software, Java APIs often require importing other packages/libraries. For example, developers commonly

Скачать книгу