Группа авторов

Computational Statistics in Data Science


Скачать книгу

works Tableau N Notable GUI: menu, dialogs Popular for business analytics Julia Y Promising Programming Speedy, underdeveloped Scala Y Promising Programming Typed version of Java, less boilerplate code

Software Virtual environment Multiple languages Remote integration Notes
Emacs, Vim N Y Y Extensible, steep learning curve
Jupyter project Y Y Y Open source, interactive data science
RStudio Y Y Y Excellent at creating reproducible reports/docs

      1.1 Extensible Text Editors: Emacs and Vim

      Using Emacs for specifically statistical computing, we note the excellent add‐on package called Emacs Speaks Statistics (ESS) that offers a unified user interface for R, S‐Plus, SAS, Stata, and OpenBUGS/JAGS, among other popular statistical packages. An easy‐to‐use package manager provides quick ESS installation. Once installed, a basic workflow would be to open an associated file type (.R,.Rmarkdown, etc.) to trigger ESS mode. In ESS mode, code is highlighted, tab completion enabled for rapid code generation and editing, and help documentation integrated. Code can be interactively evaluated in separate processes (e.g., a single or even multiple R sessions), or code can be run noninteractively through Emacs‐displayed shell processes. Statistical visualizations are displayed in separate windows for easy plot development. As mentioned above, one can work seamlessly on remote servers (using TRAMP mode). This greatly reduces the inefficiencies inherent to switching between local and remote machines.

      We also mention another popular extensible text editor Vim (https://www.vim.org/). Vim offers many of the same benefits as Emacs. There is a constant debate over the superiority of either Vim or Emacs. We avoid this discussion here and simply admit that the first author is an Emacs user, leading to the discussion above. This is not a vote of confidence toward Emacs over Vim but simply a reflection of familiarity.

      1.2 Jupyter Notebooks

      The Jupyter Project is an effort to develop open‐source software and services for interactive computing across a variety of popular programming languages such as Python, R, Julia, and C++. The interactive environment is based on notebooks which contain text cells and code cells. Text cells can utilize a mix of plain text, markdown, and render LaTeX through the Mathjax engine. Code cells can be run, modified, and rerun in any order. This functionality makes it easy to perform data analyses and document your work as you go.

      The Jupyter IDE (integrated development environment) is run locally in a web browser and can be configured for remote and multiuser workflows. Since reproducible data science is a core feature of the Jupyter Project, they have made it so that notebooks can be exported and shared online as an interactive document or as a static HTML or PDF document. Services such as mybinder.org let a user upload and run notebooks online so that an analysis is instantly reproducible by anyone.

      1.3 RStudio and Rmarkdown

      RStudio is an organization that develops free and enterprise‐ready tools for working with the R language. Their IDE (also called RStudio) integrates the R console, file browser, script editor, and more in one unified user interface. Through the use of project‐associated directories/files, the entire projects are nearly self‐contained and easily shared among different systems.

      With introductory matters behind, we now transition to discussions of the most popular statistical computing languages. We begin with R, our preferred statistical programming language. This leads to an unbalanced discussion compared to the other most popular statistical software (Python, SAS, and SPSS); yet we hope to provide objective recommendations despite the unequal coverage.

      2.1 R

      R [1] began at the University of Auckland, New Zealand, in the early 1990s. Ross Ihaka and Robert Gentleman needed a statistical environment to use in their teaching lab. At the time, their computer labs featured only Macintosh computers that lacked suitable software. Ihaka and Gentleman decided to implement a language based on an S‐like syntax [2]. R's initial versions were provided to Statlib at Carnegie Mellon University, and the user feedback indicated a positive reception.

      R's success encouraged its release under the Open Source Initiative (https://opensource.org/). Developers released the first version in June 1995. A software system under the open‐source paradigm benefits from having “many pairs of eyes to develop the software.” R developed a huge following, and it soon became difficult for the developers to maintain. As a response, a 10‐member core group was formed in 1997. The core team handles any changes to the R source code. The massive R community provides support via online mailing lists (https://www.r‐project.org/mail.html)