Daniel J. Denis

Applied Univariate, Bivariate, and Multivariate Statistics Using Python


Скачать книгу

not so much in terms of data analysis, but rather in examples of how hypothesis-testing works and the like. In this way, it is hoped examples and analogies “hit home” a bit more for readers and students, making the issues “come alive” somewhat rather than featuring abstract examples.

       Python code is “unpacked” and explained in many, though not all, places. Many existing books on the market contain explanations of statistical concepts (to varying degrees of precision) and then plop down a bunch of code the reader is expected to simply implement and understand. While we do not avoid this entirely, for the most part we guide the reader step-by-step through both concepts and Python code used. The goal of the book is in understanding how statistical methods work, not arming you with a bunch of code for which you do not understand what is behind it. Principal components code, for instance, is meaningless if you do not first understand and appreciate to some extent what components analysis is about.

      Statistical Knowledge vs. Software Knowledge

      Having now taught at both the undergraduate and graduate levels for the better part of fifteen years to applied students in the social and sometimes natural sciences, to the delight of my students (sarcasm), I have opened each course with a lecture of sorts on the differences between statistical vs. software knowledge. Very little of the warning is grasped I imagine, though the real-life experience of the warning usually surfaces later in their graduate careers (such as at thesis or dissertation defenses where they may fail to understand their own software output). I will repeat some of that sermon here. While this distinction, historically, has always been important, it is perhaps no more important than in the present day given the influx of computing power available to virtually every student in the sciences and related areas, and the relative ease with which such computing power can be implemented. Allowing a new teen driver to drive a Dodge Hellcat with upward of 700 horsepower would be unwise, yet newcomers to statistics and science, from their first day, have such access to the equivalent in computing power. The statistician is shaking his or her head in disapproval, for good reason. We live in an age where data analysis is available to virtually anybody with a laptop and a few lines of code. The code can often easily be dug up in a matter of seconds online, even with very little software knowledge. And of course, with many software programs coding is not even a requirement, as windows and GUIs (graphical user interfaces) have become very easy to use such that one can obtain an analysis in virtually seconds or even milliseconds. Though this has its advantages, it is not always and necessarily a good thing.

      The problem, succinctly put, is that in many sciences, and contrary to the opinion you might expect from someone writing a data analysis text, students learn too much on how to obtain output at the expense of understanding what the output means or the process that is important in drawing proper scientific conclusions from said output. Sadly, in many disciplines, a course in “Statistics” would be more appropriately, and unfortunately, called “How to Obtain Software Output,” because that is pretty much all the course teaches students to do. How did statistics education in applied fields become so watered down? Since when did cultivating the art of analytical or quantitative thinking not matter? Faculty who teach such courses in such a superficial style should know better and instead teach courses with a lot more “statistical thinking” rather than simply generating software output. Among students (who should not necessarily know better – that is what makes them students), there often exists the illusion that simply because one can obtain output for a multiple regression, this somehow implies a multiple regression was performed correctly in line with the researcher’s scientific aims. Do you know how to conduct a multiple regression? “Yes, I know how to do it in software.” This answer is not a correct answer to knowing how to conduct a multiple regression! One need not even understand what multiple regression is to “compute one” in software. As a consultant, I have also had a client or two from very prestigious universities email me a bunch of software output and ask me “Did I do this right?” assuming I could evaluate their code and output without first knowledge of their scientific goals and aims. “Were the statistics done correctly?” Of course, without an understanding of what they intended to do or the goals of their research, such a question is not only figuratively, but also literally impossible to answer aside from ensuring them that the software has a strong reputation for accuracy in number-crunching.