Группа авторов

Computational Statistics in Data Science


Скачать книгу

sectors – academia, industry, and government.

      Overall, Stata impresses through active support and development while possessing some unique characteristics. Interestingly, in scholarly work over the past decade, only SPSS, R, and SAS have overshadowed Stata [4]. Taken together, we anticipate Stata to remain popular. However, Stata's big data capabilities are limited, and we have reservations whether industry will adopt Stata over competitors.

      3.13 Tableau®

      Tableau stemmed from visualization research by Stanford University's computer science department in 1999. The Seattle‐based company was founded in 2003. Tableau advertises itself as a data exploration and visualization tool, not a statistical software per se. Tableau targets the business intelligence market primarily. However, Tableau provides a free, less powerful version for instruction.

      Tableau is versatile and user‐friendly: providing MacOS and Windows versions while supporting web‐based apps on iOS and Android. Tableau connects seamlessly to SQL databases, spreadsheets, cloud apps, and flat files. The software appeals to nontechnical “business” users via its intuitive user interface but also allows “power users” to develop analytical solutions by connecting to an R server or installing TabPy to integrate Python scripts.

      Tableau could corner the data visualization market with its easy‐to‐learn interface, yet intricate features. We contend that big data demands visualization as many traditional methods are not well suited for high‐dimensional, observational data. Based on its unique characteristics, Tableau will appeal broadly and could even emerge as a useful tool to supplement an R or Python user's toolkit.

      With a forward‐thinking mindset, our final section describes a few emerging and promising statistical software languages/packages that have the ability to meet tomorrow's complex modeling demands. If a reader encounters scalability challenges in their current statistical programming language, one of the following options may turn a computationally infeasible model into a useful one.

      4.1 Edward, Pyro, NumPyro, and PyMC3

      Recently, there have been several important probabilistic programming libraries released for Python, namely, Edward, Pyro, NumPyro, and PyMC3. These packages are characterized by the capacity to fit broad classes of models, with massive number of parameters, using advanced particle simulators (such as Hamiltonian Monte Carlo (HMC)).

      4.2 Julia

      Julia is a new language designed by Bezanson et al. and was released in 2012 [27]. Julia's first stable version (1.0) was released in August 2018. The developers describe themselves as “greedy” – they want a software application that does it all. Users no longer would create prototypes in scripting languages than port to C or Java for speed. Below, we quote from Julia's public announcement (https://julialang.org/blog/2012/02/why‐we‐created‐julia):

      We want a language that's open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that's homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like MATLAB. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as MATLAB, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

      Despite the stated goals, we classify Julia as an analysis software at this early stage. Indeed, Julia's syntax exhibits elegance and friendliness to mathematics. The language natively implements an extensive mathematical library. Julia's core distribution includes multidimensional arrays, sparse vectors/matrices, linear algebra, random number generation, statistical computation, and signal processing.

      Julia's design affords speeds comparable to C due to it being an interpreted, embeddable language with a JIT compiler. The software also implements concurrent threading, enabling parallel computing natively. Julia integrates nicely with other languages including calling C directly, Python via PyCall, and R via RCall.

      Julia exhibits great promise but remains nascent. We are intrigued by a language that does it all and is easy to use. Yet, Julia's underdevelopment limits its statistical analysis capability. On the other hand, Julia is growing fast with active support and positive community outlook. Coupling Julia's advantages and MATLAB's diminishing appeal, we anticipate Julia to contribute in the area for years to come.

      4.3 NIMBLE

      4.4 Scala

      An emerging data science tool, Scala (https://www.scala‐lang.org/), combines object‐oriented and functional paradigms in a high‐level programming language. Scala is built for complex applications and workflows. To meet such applications, static object typing keeps the code bug‐free, even during numerous parallelized computations or asynchronous programming (dependent jobs). Scala is designed for interoperability with Java/JavaScript as it runs on Java Virtual Machine. This provides access to the entire Java ecosystem. Scala interfaces with Apache Spark (as does Python and R) for scalable, accurate, and numeric operations. In short, Scala scales Java for high‐performance computing.

      4.5 Stan

      Stan [28] is a PPL for specifying models, most often Bayesian. Stan samples posterior distributions using HMC – a variant of Markov Chain Monte Carlo (MCMC). HMC boasts a more robust and efficient approach over Gibbs or Metropolis‐Hastings sampling for complex models, while providing insightful diagnostics to assess convergence and mixing. This may explain why Stan is gaining popularity over other Bayesian samplers (such as BUGS [10] and JAGS [11]).

      Stan provides a flexible and principled model specification framework. In addition to fully Bayesian inference, Stan computes log densities and Hessians, variational Bayes, expectation propagation, and approximate integration. Stan is available as a command line tool or R/Python interface (RStan and PyStan, respectively).

      Stan has the ability to become the de facto Bayesian modeling software. Designed