the like are churning out critical thinkers at lightning speed. And if working in data is all about uncovering the truth, then Data Heads want to do just that.
What does it mean, then, when they sit down to a project that doesn't whet their appetite? What does it mean for them to have to work on a poorly defined issue where their skills become bragging rights for executives but don't actually solve meaningful problems?
It means many data workers are dissatisfied at their jobs. Having them work on problems overly focused on technology with ambiguous outcomes leads to frustration and disillusionment. Kaggle.com, where data scientists from all over the world compete in data science competitions and learn new analysis methods, posted a survey and asked data scientists what barriers they face at work.2 Several of the barriers, listed here, are directly related to poorly defined problems and improper planning:
Lack of clear question to answer (30.4% of respondents experienced this)
Results not used by decision makers (24.3%)
Lack of domain expert input (19.6%)
Expectations of project impact (15.8%)
Integrating findings into decisions (13.6%)
This has obvious consequences. Those who aren't satisfied in their roles leave.
CHAPTER SUMMARY
The very premise and structure of this book is to teach you to ask more probing questions. It starts with the most important, and sometimes hardest, question: “What's the problem?”
In this chapter, you learned ways to refine and clarify the central business question and why problems involving data and analysis are particularly challenging. We shared five important questions a Data Head should ask when defining a problem. You also learned about early warning signs to spot when a question starts to go off track. If the question hints of having a (1) methodology focus or a (2) deliverable focus, it's time to hit pause.
When these questions are answered, you are ready to get to work.
NOTES
1 1 A robust data strategy can help companies mitigate these issues. Of course, an important component of any data strategy is to solve meaningful problems, and that's our focus in this chapter. If you'd like to learn more about high-level data strategy, see Jagare, U. (2019). Data science strategy for dummies. John Wiley & Sons.
2 2 2017 Kaggle Machine Learning & Data Science Survey. Data is available at www.kaggle.com/kaggle/kaggle-survey-2017. Accessed on January 12, 2021.
CHAPTER 2 What Is Data?
“If we have data, let's look at data. If all we have are opinions, let's go with mine.”
—Jim Barksdale, former Netscape CEO
Many people work with data without having a dialect for it. However, we want to ensure we're all speaking the same language to make the rest of the book easier to follow. So, in this chapter, we'll give you a brief crash course on data and data types. If you've had a basic statistics or analytics course, you'll know the terms that follow but there may be parts of our discussion not covered in your class.
DATA VS. INFORMATION
The terms data and information are often used interchangeably. In this book, however, we make a distinction between the two.
Information is derived knowledge. You can derive knowledge from many activities: measuring a process, thinking about something new, looking at art, and debating a subject. From the sensors on satellites to the neurons firing in our brains, information is continually created. Communicating and capturing that information, however, is not always simple. Some things are easily measurable while others are not. But we endeavor to communicate knowledge for the benefit of others and to store what we've learned. And one way to communicate and store information is by encoding it. When we do this, we create data. As such, data is encoded information.
An Example Dataset
Table 2.1 tells the story of a company. Each month, they run a different marketing campaign online, on television, or in print media (newspapers and magazines). The process they run generates new information each month. The table they've created is an encoding of this information and thus it holds data.
A table of data, like Table 2.1, is called a dataset.
Notice that it has both rows and columns that serve specific functions in how we understand the table. Each row of the table (running horizontally, under the header row) is a measured instance of associated information. In this case, it's a measured instance of information for a marketing campaign. Each column of the table (running vertically) is a list of information we're interested in, organized into a common encoding so that we can compare each instance.
The rows of each table are commonly referred to as observations, records, tuples, or trials. Columns of datasets often go by the names features, fields, attributes, predictors, or variables.
Know Your Audience
Data is studied in many different fields, each with their own lingo, which is why there are many names for the same things. Some data workers, when talking about the columns in a dataset, might prefer “features” while others say “variables” or “predictors.” Part of being a Data Head is being able to navigate conversations within these groups and their preferences.
A data point is the intersection of an observation and a feature. For example, 150 units sold on 2021-02-01 is a data point.
TABLE 2.1 Example Dataset on Advertisement Spending and Revenue
Date | Ad Spending | Units Sold | Profit | Location |
---|---|---|---|---|
2021-01-01 | 2000 | 100 | 10452 | |
2021-02-01 | 1000 | 150 | 15349 | Online |
2021-03-01 | 3000 | 200 | 25095 | Television |
2021-04-01 | 1000 | 175 | 12443 | Online |
Table 2.1 has a header (a piece of non-numerical