Alex J. Gutman

Becoming a Data Head


Скачать книгу

and several statistically significant independent variables using alpha equal to 0.05.”

       Business Professional: *awkward silence*

       Us: “Does that make sense?”

       Business Professional: *more silence*

       Us: “Any questions?”

       Business Professional: “No questions at the moment.”

       Business Professional's internal monologue: “What the hell are they talking about?”

      If you watched this unfold in a movie, you might think wait, let's rewind, perhaps I forgot something. But in real life, where choices are truly mission critical, this rarely happens. We don't rewind. We don't ask for clarification.

      In hindsight, our presentations were too technical. Part of the reason was pure stubbornness—before the mortgage crisis, as we learned, technical details were oversimplified; analysts were brought in to tell decision makers what they wanted to hear—and we were not going to play that game. Our audiences would listen to us.

      But we overcorrected. Audiences couldn't think critically about our work because they didn't understand what we said.

      We thought to ourselves there's got to be a better way. We wanted to make a difference with our work. So we started practicing explaining complex statistical concepts to each other and to other audiences. And we started researching what others thought about our explanations.

      We discovered a middle ground between data workers and business professionals where honest discussions about data can take place without being too technical or too simplified. It involves both sides thinking more critically about data problems, large or small. That's what this book is about.

      You'll also have to embrace the side of data that's not often talked about—how, in many companies, it largely fails. You'll build intuition, appreciation, and healthy skepticism of the numbers and terms you come across. It may seem like a daunting task, but this book will show you how. And you won't need to code or have a Ph.D.

      With clear explanations, thought exercises, and analogies, we will help you develop a mental framework of data science, statistics, and machine learning.

      Let's do just that in the following example.

      Classifying Restaurants

      Imagine you're on a walk and pass by an empty store front with the sign “New Restaurant: Coming Soon.” You're tired of eating at national chains and are always on the lookout for new, locally owned restaurants, so you can't help but wonder, “Will this be a new local restaurant?”

      Let's pose this question more formally: Do you predict the new restaurant will be a chain restaurant or an independent restaurant?

      Take a guess. (Seriously, take a guess before moving on.)

      If this scenario happened in real life, you'd have a pretty good hunch in a split second. If you're in a trendy neighborhood, surrounded by local pubs and eateries, you'd guess independent. If you're next to an interstate highway and near a shopping mall, you'd guess chain.

      But when we asked the question, you hesitated. They didn't give me enough information, you thought. And you were right. We didn't give you any data to make a decision.

      Lesson learned: Informed decisions require data.

      Now look at the data in the first image on the next page. The new restaurant is marked with an X, the Cs indicate chain restaurants, and the Is indicate independent, local eateries. What would you guess this time?

      Lesson learned: Predictions should never be 100% confident.

Schematic illustration of Over the Rhine neighborhood, Cincinnati, Ohio.

       Over the Rhine neighborhood, Cincinnati, Ohio

Schematic illustration of Kenwood Towne Centre, Cincinnati, Ohio.

       Kenwood Towne Centre, Cincinnati, Ohio

      During this thought experiment, everyone creates a slightly different algorithm in their head. Of course, everyone looks at the markers surrounding the point of interest, X, to understand the neighborhood, but at some point, you must decide when a restaurant is too far away to influence your prediction. At one extreme (and we see it happen), someone looks at the restaurant's single closest neighbor, in this case an independent restaurant, and bases their prediction on it: “The nearest neighbor to X is an (I), so my prediction is (I).”

      Most people, however, look at several neighboring restaurants. The second image shows a circle surrounding the new restaurant containing its seven nearest neighbors. You probably chose a different number, but we chose 7, and 6 out of the 7 are (C) chains, so we'd predict (C).

      So What?

      If you understand the restaurant example, you're well on your way to becoming a Data Head. Let's reveal what you learned, little by little:

       You performed classification by predicting the label (chain or independent) on a new restaurant by training an algorithm using a set of data (restaurants’ location and their chain/independent label).

       This is precisely machine learning! You just didn't build the algorithm on a computer—you used your head.

       Specifically, this is a type of machine learning called supervised learning. It was “supervised” because you knew the existing restaurants were (C) chain or (I) independent. The labels directed (i.e., supervised) your thinking about how restaurant location is related to whether it's a chain or not.

       Even more specifically, you performed a supervised learning classification algorithm called K-nearest-neighbor.5 If K = 1, look at the closest restaurant and that's your prediction. If K = 7, look at the 7 closest restaurants and predict the majority. It's an intuitive and powerful algorithm. And it's not magic.

       You also learned you need data to make informed decisions. Realize, however, that you need more than that. After all, this book is about critical thinking. We want to show how stuff works but also how it fails. If we asked you to predict, given the data in this Introduction's images, if the new restaurant would be kid-friendly, you wouldn't be able to answer. To make informed decisions, not just any data will do. You need accurate, relevant, and enough data.

       Remember the technobabble we wrote earlier? “… supervised learning analysis of the binary response variable …”? Congratulations, you just did a supervised learning analysis of a binary response variable. Response variable is another name for