Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

computing forums – such as Talk Stats (http://www.talkstats.com/), Cross Validated (https://stats.stackexchange.com/), and Stack Overflow (https://stackoverflow.com/). Often users receive responses within a matter of minutes.

Since humble beginnings, R has developed into a popular, complete, and flexible statistical computing environment that is appreciated by academia, industry, and government. R's main benefits include support on all major operating systems and comprehensive package archives. Further, R integrates well with document formats (such as LaTeX (https://www.latex‐project.org/), HTML, and Microsoft Word) through R Markdown (https://rmarkdown.rstudio.com/) and other file formats to enhance literate programming and reproducible data analysis.

R provides extensive statistical capacity. Nearly any method is available as an R package – the trick is locating the software. The base package and default included packages perform most standard analyses and computation. If the included packages are insufficient, one can use CRAN (the comprehensive R archive network) that houses nearly 13 000 packages (visit https://cran.r‐project.org/ for more information). To help navigate CRAN, “CRAN Task Views” organizes packages into convenient topics (https://cran.r‐project.org/web/views/). For bioinformatics, over 1500 packages reside on Bioconductor [3]. Developers also distribute their packages via git repositories, such as github (https://github.com/). For easy retrieval from github, the devtools package allows direct installation.

2.1.1 Why use R over Python or Minitab?

R is tailored to working with data and performing statistical analysis in a way that is more consistent and extensible than Python. The syntax for accessing data in lists and data frames is convenient with tab completion showing what elements are in an object. Creating documents, reports, notebooks, presentations, and web pages is possible through Rmarkdown/RStudio.

Through the use of the metapackage tidyverse or the library data.table, working with tabular data is direct, efficient, and intuitive. Because R is a scripted language, reproducible workflows are possible, and steps in the process of extracting and transforming data are easy to go back and modify without disrupting the analysis. While this is a virtue shared among all scripting languages, the nature of reproducible results and modular code saves time compared to a point‐and‐click interface like that of Excel or Minitab.

2.1.2 Where can users find R support?

R has a large community for support online and even built‐in documentation within the software. Most libraries provide documentation and examples for their functions and objects that can be accessed via the ? in the command line (e.g., type ?glm for help about creating a generalized linear model). These help documents are displayed directly in the console, or if using RStudio, they are displayed in the help panel with extra links to related functions. For more in‐depth documentation, some developers provide vignettes for their packages. Vignettes are long‐form documentation that demonstrates how to use the functionality in the package and tie it together with a working example.

The online R community is lively, and the people are often helpful. Searching for any question about R or its packages will often lead you to a post on Stack Overflow (https://stackoverflow.com/) or Reddit (either r/rstats or r/RStudio). There is also the RStudio Community (https://community.rstudio.com/) where you can go to ask questions about features specific to the IDE. It is rare to encounter an R programming challenge that has not been addressed somewhere online and, in that case, a well‐posed question posted on such forums is quickly answered. Twitter also has an active community of developers that can sometimes respond directly (such as # RSTUDIO or HADLEYWICKHAM).

2.1.3 How easy is R to develop?

R is becoming easier and easier to develop packages and analyses with. This is largely due to the efforts of RStudio, bringing slick new tools and support software on a regular basis. Their software “combine robust and reproducible data analysis with tools to effectively share data products.” One package that integrates well with RStudio is devtools written by Dr Hadley Wickham, the chief scientist at RStudio. devtools provides a plethora of tools to create, test, and export R packages. devtools has grown so comprehensive that developers have split the project into several smaller packages such as testthat (for writing tests), roxygen2 (for writing R documentation), usethis (for automating package setup, data, imports, etc.), and a few others that provide convenient tools for building and testing packages.

2.1.4 What is the downside of R?

R is slow. Or at least that is the perception and sometimes the case. This is because R is not a compiled language, so methods of flow control such as for‐loops are not optimized. This shortcoming is easily circumvented by taking advantage of the vectorization offered through other built‐in functions like those from the apply family in R, but these faster techniques often go unused through lack of proficiency or because it is easier to write a for‐loop. Intrinsically slow functions can be written in C++ and run via Rcpp, but then that negates the simplicity of writing R. This is a special case where Python easily surpasses R. Python is also a scripted language, but through the use of NumPy and numba it can gain fast vectorized operations, loops, and utilize a just‐in‐time (JIT) compiler. Ergo, any performance shortcoming of Python can be taken care of through a decorator.

Packages are not written by programmers, or at least not programmers by trade or education. A great deal of libraries for R are written by researchers and analysts who needed a tool and created the tool. Because of this, there is often fragmentation in the syntax or incompatibility between packages, or generally a lack of best practices that leads to poorly performing code, or, in the most drastic setting, code that simply gives erroneous results.

2.1.5 Summary of R

R is firmly entrenched as a premier statistical software package. Its open‐source, community‐based approach has taken the statistical software scene by storm. R's interactive and scripting programming style makes it an attractive and flexible analytic tool. R does lack the speed/flexibility of other languages; yet, for a specialist in statistics, R provides a near‐complete solution. RStudio's efforts further solidify R as a key player moving forward in the modern statistical software ecosystem. We see the popularity of R continuing – however, big data's demands could force R programmers to adapt other tools in conjunction with R, if companies/developers fail to keep up with tomorrow's challenges.

2.2 Python

Created by Guido van Rossum and released in 1991, Python is a hugely popular programming language [4]. Python features readable code, an interactive workflow, and an object‐oriented design. Python's architecture affords rapid application development from prototyping to production. Additionally, many tools integrate nicely with Python, facilitating complex workflows. Python also possesses speed, as most of its high‐performance libraries are implemented in C/C plus plus .

Python's core distribution lacks statistical features, prompting developers to create supplementary libraries. Below, we detail four well‐supported statistical and mathematical libraries: NumPy [5], SciPy [6], Pandas [7], and Statsmodels [8].

NumPy is a general and fundamental package for

Скачать книгу