Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

(Java matrix package, https://math.nist.gov/javanumerics/jama/) or EJML (efficient Java matrix library, http://ejml.org/wiki/). Such packages allow for routine computation – for example, matrix decomposition and dense/sparse matrix calculation. JFreeCHart enables data visualization by generating scatter plots, histograms, barplots, and so on. Recently, these Java libraries are being replaced by more popular JavaScript libraries such as Plot.ly (https://plot.ly/), Bokeh (bokeh.pydata.org), D3 [26], or Highcharts (www.highcharts.com).

As outlined above, Java could serve as a useful statistical software solution, especially for developers familiar with it or who have interest in cross‐platform development. We would then recommend its use for seasoned programmers looking to add some statistical punch to their desktop, web, and mobile apps. For the analysis of big data, Java offers some of the best ML tools available.

3.6 JavaScript, Typescript

JavaScript is one of the most popular programming languages, outpacing even Java and Python. It is fully featured, flexible, and fast, leading to its broad appeal. JavaScript excels at visualization through D3.js. JavaScript even features interactive, browser‐based ML via TensorFlow.js. For real‐time data collection and analysis, JavaScript provides streaming tools through MongoDB. JavaScript's unsurpassed popularity alone makes it worth a look, especially if tasked with a complex real‐time data analytic challenge across heterogeneous architectures.

3.7 Maple

Maple is a “math software that combines the world's most powerful math engine with an interface that makes it extremely easy to analyze, explore, visualize, and solve mathematical problems.” (https://www.maplesoft.com/products/Maple/). While not specifically a statistical software package, Maple's computer algebra system is a handy supplement to an analyst's toolkit. Often in statistical computing, a user may employ Maple to check a hand calculation or reduce the workload/error rate in lengthy derivations. Moreover, Maple offers add‐on packages for statistics, calculus, analysis, linear algebra, and more. One can even create interactive plots and animations. In sum, Maple is a solid choice for a computer algebra system to aid in statistical computing.

3.8 MATLAB, GNU Octave

MATLAB began as FORTRAN subroutines for solving linear (LINPACK) and eigenvalue (EISPACK) problems. Cleve Moler developed most of the subroutines in the 1970s for use in the classroom. MATLAB quickly gained popularity, primarily through word of mouth. Developers rewrote MATLAB in C during the 1980s, adding speed and functionality. The parent company of MATLAB, the Mathworks, Inc., was created in 1984, and MATLAB has since become a fully featured tool that is often used in engineering and developer fields where integration with sensors and controls is a primary concern.

MATLAB has a substantial user base in government, academia, and the private sector. The MATLAB base distribution allows reading/writing data in ASCII, binary, and MATLAB proprietary formats. The data are presented to the user as an array, the MATLAB generic term for a matrix. The base distribution comes with a standard set of mathematical functions including trigonometric, inverse trigonometric, hyperbolic, inverse hyperbolic, exponential, and logarithmic. In addition, MATLAB provides the user with access to cell arrays, allowing for heterogeneous data across the cells and creation analogous to a C/C plus plus . MATLAB provides the user with numerical methods, including optimization and quadrature functions.

A highly similar yet free and open‐sourced programming language is GNU Octave. Octave offers many if not all features of the core MATLAB distribution, although MATLAB has many add‐on packages for which Octave has no equivalent, and that may prompt a user to choose MATLAB over Octave. We caution analysts against using MATLAB/Octave as their primary statistical computing solution as MATLAB's popularity is diminishing [4] – likely due to open‐source, more fully featured competitors such as R and Python.

3.9 Minitab®

Barbara F. Ryan, Thomas A. Ryan, Jr., and Brian L. Joiner created Minitab in 1972 at the Pennsylvania State University to teach statistics. Now, Minitab Inc. owns the proprietary software. Academia and industry widely employ Minitab 4. The intuitive point‐and‐click design and spreadsheet‐like interface allow users to analyze data with little learning curve. Minitab feels like Excel, but with many more advanced features. This greatly reduces the Minitab learning curve compared to more flexible programming environments.

Minitab offers import tools and a comprehensive set of statistical capabilities. Minitab's features include basic statistics, ANOVA, fixed and mixed models, regression analyses, measurement systems analysis, and graphics including contour and rotating 3D plots. A full feature list resides at http://www.minitab.com/en‐us/products/minitab/features‐list/. For advanced users, a command‐line editor exists. Within the editor, users may customize macros (functions).

Minitab serves its user base well and will continue to be viable in the future. For teaching academics, Minitab provides near immediate access to many statistical methods and graphics. For industry, Minitab offers tools to produce standardized analyses and reports with little training. However, Minitab's flexibility and big data capabilities are limited.

3.10 Workload Managers: SLURM/LSF

Working on shared computing clusters has become commonplace in contemporary data science applications. Some working knowledge of workload managing programs (aka schedulers) is essential to running statistical software in these environments. Two popular workload managers are SLURM (https://slurm.schedmd.com/documentation.html) and IBM's platform load sharing facility (LSF), another popular workload management platform for distributed high‐performance computing. These schedulers can be used to execute batch jobs on networked Unix and Windows systems on many different architectures. A user would typically interface with a scheduling program via a command line tool or through a scripting language. The user specifies the hardware resources and program inputs. The scheduler then distributes the work across resources, and jobs are run based on system‐prioritization schemes. In such a way, hundreds or even thousands of programs can be run in parallel, increasing the scale of statistical computations possible within a reasonable time frame. For example, simulations for a novel statistical method could require many thousands of runs at various configurations, and this could be done in days rather than months.

3.11 SQL

Structured Query Language (SQL) is the standard language for relationship database management systems. While not strictly a statistical computing environment, the ability to query databases through SQL is an essential skill for data scientists. Nearly all companies seeking a data scientist require SQL knowledge as much of an analyst's job is extracting, transforming, and loading data from an established relational database.

3.12 Stata®

Stata is commercial statistical software, developed by William Gould in 1985. StatCorp currently owns/develops Stata and markets the product as “fast, accurate, and easy to use with both a point‐and‐click interface and a powerful, intuitive command syntax” (https://www.stata.com/). However, most Stata users maintain the point‐and‐click workflow. Stata strives to provide user confidence through regulatory certification.

Stata provides hundreds of tools across broad applications and methods. Even Bayesian modeling and maximum‐likelihood

Скачать книгу