Daniel J. Denis

Applied Univariate, Bivariate, and Multivariate Statistics Using Python


Скачать книгу

issue may still remain. For instance, how will you convince your committee that what you have measured is actually a good measure of self-esteem?

      In recent years, the “data explosion” has gripped much of science, business, and almost every field of inquiry. Thanks to computers and advanced data warehouse capacities that could have only been dreamt of in years past (and will seem trivial in years to come), the “data deluge” is officially upon us. The facility by which statistical and software analyses can be conducted has increased dramatically. New designations for quantitative analyses often fall under the names of data science and machine learning, and because data is so cheap to store, many corporations, both academic and otherwise, can collect and store massive amounts of data – so much so, that analysis of such data sometimes falls under the title of “big data.” For example, world population data regarding COVID-19 were analyzed in an attempt to spot trends in the virus across age groups, extent of comorbidity with other illnesses, among other things. Such analyses are usually done on very large and evolving databases. The mechanisms for storing and accessing such data are, rightly so, not truly areas of “statistics” per say, and have more to do with data engineering and the like. The field of machine learning, an area primarily in computer science, is an emerging area that emphasizes modern software technology in analyzing data, deciphering trends, and visually depicting results via advanced and sophisticated graphics. As you venture further into data analysis in general, some of the algorithms you may use may come from this field.

      Though the fields of data science, machine learning, and other allied fields are relatively new and exciting, it is nonetheless important for the reader to not simply and automatically associate new words with necessarily new “things.” Human beings are creatures of psychological association, and so when we hear of a new term, we often create a new category for that term in our minds, and we assume that since there is a new word, there must be an equivalent new category. However, that does not necessarily imply the new association we have created is one-to-one with the reality of the object. The new vehicle promoted by a car company may be an older design “updated” rather than an entirely new vehicle. Hence, when you hear new terminology in quantitative areas, it is imperative that you never stop with the word alone, but instead delve in deeper to see what is actually “there” in terms of new substance. Why is this approach to understanding important? It is important because otherwise, especially as a newcomer to these areas, you may come to believe that what you are studying is entirely novel. Indeed, it may be “new,” but it may not be as novel or categorically different from the “old” as you may at first think. Likewise, humanistic psychology of the 1950s was not entirely new. The Greeks had very similar ideas. The marketing was new, but the ideas were generally not.

      This discussion is not meant to start a “turf war” over the priority of human intellectual invention. Far from it. If we were to do that, then we would have to also acknowledge that though Newton and Leibniz put the final touches on the calculus, the idea that they “invented” it, in the truest sense of the word, is a bit of a far cry. Priority disputes in the history of human discovery usually prove futile and virtually impossible to resolve, even among those historians who study the most ancient of roots of intellectual invention on a full-time basis. That is, even assigning priority to ancient discoveries of intellectual concepts is exceedingly difficult (especially without lawyers!), which further provides evidence that “modern” concepts are often not modern at all. As another example, the concept of a computer may not have been a modern invention. Historians have shown that its primitive origin may possibly go back to Charles Babbage and the “Analytical Engine,” and its concept probably goes far beyond that in years as well (Green, 2005). As the saying goes, the only things we do not know is the history we are unaware of or, as Mark Twain once remarked, few if any ideas are original, and can usually be traced back to earlier ones.

      One aspect of the “data revolution” with data science and machine learning leading the way has been the emphasis on the concept of statistical learning. As mentioned, simply because we assign a new word or phrase to something does not necessarily mean that it represents the equivalent of something entirely new. The phrase “statistical learning” is testimony to this. In its simplest and most direct form, statistical learning simply means fitting a model or algorithm to data. The model then “learns” from the data. But what does it learn exactly? It essentially learns what its estimators (in terms of selecting optimal values for them) should be in order to maximize or minimize a function. For example, in the case of a simple linear regression, if we take a model of the type yi = α + βxi + εi and fit it to some data, the model “learns” from the data what are the best values for a and b, which are estimators for α and β, respectively. Given that we are using ordinary least-squares as our method of estimation, the regression model “learns” what a and b should be such that the sum of squared errors is kept to a minimum value (if you don’t know what all this means, no worries, you’ll learn it in Chapter 7). The point here for this discussion is that the “learning” or “training” consists simply of selecting scalars