Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

architecture promotes cross‐compatibility and extensibility, and the general‐purpose posterior sampler with innovative diagnostics appeals to novice and advanced modelers alike. Further, to our knowledge, Stan is the only general‐purpose Bayesian modeler that scales to thousands of parameters – a boon for big data analytics.

5 The Future of Statistical Computing

Two key drivers will dictate statistical software moving forward: (i) Increased model complexity and (ii) increased data collection speed and sheer size (big data). These two factors will require software to be highly flexible – the languages must be easy to work with for small‐to‐medium data sets/models, while easily scaling to massive data sets/models. The software must give easy access to the latest computer hardware (including GPUs) and provide hassle‐free parallel distribution of tasks. To this end, successful statistical software must feature compiled/optimized code of the latest algorithms, parallelization, and cloud/cluster computing support. Likely, one tool will not meet all the demands, and therefore cross‐compatibility standards must be developed. Moreover, data visualization will become increasingly important (including virtual reality) for large, complex data sets where conventional inferential tools are suspect or without use.

The advantages of open‐source, community‐based development have been emphasized throughout – especially in the scholarly arena and with smaller businesses. The open‐source paradigm enables rapid software development with limited resources. However, commercial software with dedicated support services will appeal to certain markets, including medium‐to‐large businesses.

6 Concluding Remarks

We attempted to evaluate the current statistical software landscape. Admittedly, our treatment has been focused by our experience. We have, however, sought to be fair in our appraisal and provide the burgeoning statistical programmer the information required to make strong tool selection choices and increase their performance. We begin by in‐depth discussions of the most‐popular statistical software, followed by brief descriptions of many other noteworthy tools, and then finally highlighted a handful of emerging statistical software. We hope that this organization is useful, but note that it is solely based on our experiences and informal popularity studies [4]. We also provided a limited prognostication with regard to the statistical software future by identifying issues and applications likely to shape software development. We realize, of course, that the future is usually full of surprises and only time will tell what actually occurs.

Acknowledgments

The work of the two authors, AG Schissler and A Knudson, was partially supported by the NIH grant (1U54GM104944) through the National Institute of General Medical Sciences (NIGMS) under the Institutional Development Award (IDeA) program. The authors thank the Wiley staff and editor of this chapter, Dr Walter W. Piegorsch, for their expertise and support.

References

1 1 R Core Team (2018) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

2 2 Venables, W. and Ripley, B.D. (2013) S Programming, Springer Science & Business Media, New York, NY, USA.

3 3 Gentleman, R.C., Carey, V.J., Bates, D.M., et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5 (10), R80.

4 4 Muenchen, R.A. (2019) The Popularity of Data Science Software, r4stats.com/articles/popularity.

5 5 Oliphant, T.E. (2006) A Guide to NumPy, vol. 1, Trelgol Publishing, Provo, UT, USA, p. 85.

6 6 Jones, E., Oliphant, T., and Peterson, P. (2001) SciPy: open source scientific tools for Python.

7 7 McKinney, W. (2011) pandas: a foundational Python library for data analysis and statistics. Python High Performance Sci. Comput., 14 (9), 1–9.

8 8 Seabold, S. and Perktold, J. (2010) Econometric and Statistical Modeling with Python Skipper Seabold 1 1. Proceedings of the 9th Python in Science Conference, vol. 57, p. 61.

9 9 Hunter, J.D. (2007) Matplotlib: a 2D graphics environment. Comput. Sci. Eng., 9 (3), 90–95.

10 10 Thomas, A., Spiegelhalter, D.J., and Gilks, W.R. (1992) BUGS: a program to perform Bayesian inference using Gibbs sampling. Bayesian Stat., 4 (9), 837–842.

11 11 Plummer, M. (2005) JAGS: just another Gibbs sampler. Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, Austria.

12 12 Intel (2007) Intel® Math Kernel Library Reference Manual, https://software.intel.com/en‐us/mkl.

13 13 Whaley, R.C. and Petitet, A. (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exp., 35 (2), 101–121.

14 14 Xianyi, Z., Qian, W., and Chothia, Z. (2012) OpenBLAS, p. 88, http://xianyi.github.io/OpenBLAS.

15 15 Anderson, E., Bischof, C., Demmel, J., et al. (1990) Prospectus for an Extension to LAPACK. Working Note ANL‐90‐118, Argonne National Laboratory.

16 16 Guennebaud, G., et al. (2010) Eigen v3.

17 17 Sanderson, C., and Curtin, R. (2016) Armadillo: a template‐based C++ library for linear algebra. J. Open Source Softw., 1 (2), 26.

18 18 Iglberger, K., Hager, G., Treibig, J., and Rüde, U. (2012) High Performance Smart Expression Template Math Libraries. 2012 International Conference on High Performance Computing and Simulation (HPCS) (pp. 367–373) IEEE.

19 19 Dagum, L., and Menon, R. (1998) OpenMP: an industry standard API for shared‐memory programming. IEEE Comput. Sci. Eng., 5 (1), 46–55.

20 20 Heller, T., Diehl, P., Byerly, Z., et al. (2017) Hpx‐An Open Source C++ Standard Library for Parallelism and Concurrency. Proceedings of OpenSuCo, p. 5.

21 21 Frank, E., Hall, M.A., and Witten, I.H. (2016) The WEKA Workbench, Morgan Kaufmann, Burlington, MA.

22 22 Raff, E. (2017) JSAT: Java statistical analysis tool, a library for machine learning. J. Mach. Learn. Res., 18 (1), 792–796.

23 23 Abadi, M., Agarwal, A., Barham, P., et al. (2015) TensorFlow: large‐scale machine learning on heterogeneous systems.

24 24 Zaharia, M., Xin, R.S., Wendell, P., et al. (2016) Apache spark: a unified engine for big data processing. Commun. ACM, 59 (11), 56–65.

25 25 Meng, X., Bradley, J., Yavuz, B., et al. (2016) Mllib: machine learning in Apache Spark. J. Mach. Learn. Res., 17 (1), 1235–1241.

26 26 Bostock, M., Ogievetsky, V., and Heer, J. (2011) D3 data‐driven documents. IEEE Trans. Vis. Comput. Graph., 17 (12), 2301–2309.

27 27 Bezanson, J., Karpinski, S., Shah, V.B., and Edelman, A. (2012) Julia: a fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145.

28 28 Carpenter, B., Gelman, A., Hoffman, M.D., et al. (2017) Stan: a probabilistic programming language. J. Stat. Softw., 76 (1), 1–32.

3 An Introduction to Deep Learning Methods

Yao Li1, Justin Wang2, and Thomas C. M. Lee2

1University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

2University of California at Davis, Davis,

Скачать книгу