Группа авторов

Computational Statistics in Data Science


Скачать книгу

architecture promotes cross‐compatibility and extensibility, and the general‐purpose posterior sampler with innovative diagnostics appeals to novice and advanced modelers alike. Further, to our knowledge, Stan is the only general‐purpose Bayesian modeler that scales to thousands of parameters – a boon for big data analytics.

      The advantages of open‐source, community‐based development have been emphasized throughout – especially in the scholarly arena and with smaller businesses. The open‐source paradigm enables rapid software development with limited resources. However, commercial software with dedicated support services will appeal to certain markets, including medium‐to‐large businesses.

      We attempted to evaluate the current statistical software landscape. Admittedly, our treatment has been focused by our experience. We have, however, sought to be fair in our appraisal and provide the burgeoning statistical programmer the information required to make strong tool selection choices and increase their performance. We begin by in‐depth discussions of the most‐popular statistical software, followed by brief descriptions of many other noteworthy tools, and then finally highlighted a handful of emerging statistical software. We hope that this organization is useful, but note that it is solely based on our experiences and informal popularity studies [4]. We also provided a limited prognostication with regard to the statistical software future by identifying issues and applications likely to shape software development. We realize, of course, that the future is usually full of surprises and only time will tell what actually occurs.

      The work of the two authors, AG Schissler and A Knudson, was partially supported by the NIH grant (1U54GM104944) through the National Institute of General Medical Sciences (NIGMS) under the Institutional Development Award (IDeA) program. The authors thank the Wiley staff and editor of this chapter, Dr Walter W. Piegorsch, for their expertise and support.

      1 1 R Core Team (2018) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

      2 2 Venables, W. and Ripley, B.D. (2013) S Programming, Springer Science & Business Media, New York, NY, USA.

      3 3 Gentleman, R.C., Carey, V.J., Bates, D.M., et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5 (10), R80.

      4 4 Muenchen, R.A. (2019) The Popularity of Data Science Software, r4stats.com/articles/popularity.

      5 5 Oliphant, T.E. (2006) A Guide to NumPy, vol. 1, Trelgol Publishing, Provo, UT, USA, p. 85.

      6 6 Jones, E., Oliphant, T., and Peterson, P. (2001) SciPy: open source scientific tools for Python.

      7 7 McKinney, W. (2011) pandas: a foundational Python library for data analysis and statistics. Python High Performance Sci. Comput., 14 (9), 1–9.

      8 8 Seabold, S. and Perktold, J. (2010) Econometric and Statistical Modeling with Python Skipper Seabold 1 1. Proceedings of the 9th Python in Science Conference, vol. 57, p. 61.

      9 9 Hunter, J.D. (2007) Matplotlib: a 2D graphics environment. Comput. Sci. Eng., 9 (3), 90–95.

      10 10 Thomas, A., Spiegelhalter, D.J., and Gilks, W.R. (1992) BUGS: a program to perform Bayesian inference using Gibbs sampling. Bayesian Stat., 4 (9), 837–842.

      11 11 Plummer, M. (2005) JAGS: just another Gibbs sampler. Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, Austria.

      12 12 Intel (2007) Intel® Math Kernel Library Reference Manual, https://software.intel.com/en‐us/mkl.

      13 13 Whaley, R.C. and Petitet, A. (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exp., 35 (2), 101–121.

      14 14 Xianyi, Z., Qian, W., and Chothia, Z. (2012) OpenBLAS, p. 88, http://xianyi.github.io/OpenBLAS.

      15 15 Anderson, E., Bischof, C., Demmel, J., et al. (1990) Prospectus for an Extension to LAPACK. Working Note ANL‐90‐118, Argonne National Laboratory.

      16 16 Guennebaud, G., et al. (2010) Eigen v3.

      17 17 Sanderson, C., and Curtin, R. (2016) Armadillo: a template‐based C++ library for linear algebra. J. Open Source Softw., 1 (2), 26.

      18 18 Iglberger, K., Hager, G., Treibig, J., and Rüde, U. (2012) High Performance Smart Expression Template Math Libraries. 2012 International Conference on High Performance Computing and Simulation (HPCS) (pp. 367–373) IEEE.

      19 19 Dagum, L., and Menon, R. (1998) OpenMP: an industry standard API for shared‐memory programming. IEEE Comput. Sci. Eng., 5 (1), 46–55.

      20 20 Heller, T., Diehl, P., Byerly, Z., et al. (2017) Hpx‐An Open Source C++ Standard Library for Parallelism and Concurrency. Proceedings of OpenSuCo, p. 5.

      21 21 Frank, E., Hall, M.A., and Witten, I.H. (2016) The WEKA Workbench, Morgan Kaufmann, Burlington, MA.

      22 22 Raff, E. (2017) JSAT: Java statistical analysis tool, a library for machine learning. J. Mach. Learn. Res., 18 (1), 792–796.

      23 23 Abadi, M., Agarwal, A., Barham, P., et al. (2015) TensorFlow: large‐scale machine learning on heterogeneous systems.

      24 24 Zaharia, M., Xin, R.S., Wendell, P., et al. (2016) Apache spark: a unified engine for big data processing. Commun. ACM, 59 (11), 56–65.

      25 25 Meng, X., Bradley, J., Yavuz, B., et al. (2016) Mllib: machine learning in Apache Spark. J. Mach. Learn. Res., 17 (1), 1235–1241.

      26 26 Bostock, M., Ogievetsky, V., and Heer, J. (2011) D3 data‐driven documents. IEEE Trans. Vis. Comput. Graph., 17 (12), 2301–2309.

      27 27 Bezanson, J., Karpinski, S., Shah, V.B., and Edelman, A. (2012) Julia: a fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145.

      28 28 Carpenter, B., Gelman, A., Hoffman, M.D., et al. (2017) Stan: a probabilistic programming language. J. Stat. Softw., 76 (1), 1–32.

      1 de Leeuw, J. (2009) Journal of Statistical Software, Wiley Interdiscip. Rev. Comput. Stat., 1 (1), 128–129.

       Yao Li1, Justin Wang2, and Thomas C. M. Lee2

       1University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

       2University of California at Davis, Davis,