patterns in larger or smaller data sets that humans didn’t know to seek. It’s a little miraculous how well data analytics work, if you think about it.
Finding patterns is no small matter. According to global consultants, McKinsey & Company’s report, machine learning models have outperformed most medical professions in diagnosing and predicting the onset of disease. For example, machine learning has outperformed board certified dermatologists in identifying melanoma and has beaten oncologists at accurately predicting cancers using radiomics and other machine learning techniques. Numerous other reports from other industry analysts detail a spectacular array of lifesaving successes from machine pattern discoveries.
Couple such successes with the proven success of recent mRNA COVID-19 vaccines and you’re well on the way to significant breakthroughs for a variety of disease cures and vaccines. And a lot of the secret sauce is based on the patterns found in data. Nevertheless, I’m here to say that, though there’s plenty to cheer about, it’s also prudent to realize that it’s eminently possible that one identifies the patterns correctly and yet can still completely miss the big picture.
It’s time to take a look at how that happens in order to understand in later chapters how decision intelligence helps circumvent these and similar problems in the decision-making process.
All the helicopters are broken
The trouble with data sets is that no matter how large they are, something is always missing. That’s because there’s no singular, all-inclusive data singularity — no single data source containing all known information, in other words. There’s only a hodgepodge collection of data scattered here and there and yonder. By its nature, any of those data sets is incomplete.
The thing is, people analyze incomplete data anyway because good enough is always better than perfect, simply because perfect doesn’t exist. Even if there were a data singularity, data would most certainly still be missing from the pile. There appears to be no such thing as a true know-it-all in flesh or digital form.
That means data scientists and other data professionals must make assumptions, infer, augment, and otherwise tinker about to reach a reasonable output in the final analysis. There’s nothing wrong with that. Your own human mind works that way. For example, if your eyes didn’t catch all the details in a scene, your brain reaches back to your knowledge banks and memories to fill in the blanks so that you can better interpret what you saw. That method works well in helping you select an immediate escape action in an emergency, but it’s pretty much a total fail when it comes to the recollections of eyewitnesses in legal testimonies.
People can often see many places where data is incomplete and augment it accordingly, but the other ways in which data is incomplete often escape notice, because again, your own brain is filling in a picture for you of what should be there but often isn’t.
To hammer this point home, think of the problems associated with analyzing data in the hope of discovering what causes helicopter crashes. Data from helicopter crashes around the world and over time are carefully collected to be analyzed. So far, so good, right?! Yes — until the moment the machine informs you that all the helicopters are broken, which, of course, is untrue.
But the machine thinks it’s true because the only data it saw was from crashed helicopters. To accurately analyze why helicopters crash, the analytics and AI need to see data from helicopters that didn’t crash. In that data set will be helicopters that should have crashed but didn’t, and those that nearly did but shouldn’t have, as well as helicopters that functioned properly over numerous flights and in varied conditions. Now, there’s a better view of helicopter crashes and the machine finally learns that, no, helicopters don’t crash because all the helicopters are broken. It took a human to realize that fact first, however.
Decision intelligence adds more disciplines and methodologies to the decision-making process in order to move beyond (and guard against) faulty conclusions and misleading interpretations of outputs in order to move the organization forward to its desired outcome.
MIA: Chunks of crucial but hard-to-get real-world data
At a 2019 Microsoft workshop, the powers that be gave tech journalists and industry analysts hands-on experience in programming AI chat bots and a preview of upcoming Microsoft data-related technologies, including AI, quantum computing, and bioinformatics. One topic touched on was the need for synthetic data, although if I recall correctly, Microsoft called it something else at the time. (Virtual data? Augmented data?)
Regardless of what people call it, you might ask why anyone would need to use artificially created data, given the exponential growth of data from the real world. International Data Corporation, a premier global market intelligence firm, has reported that data from the real world expected to be created over the years 2020 to 2023 is growing at a rate that will surpass the amount of data created over the past 30 years. The analysts also say that the world will create more than three times the data over the next five years than it did in the previous five. Statista, another global leader when it comes to market and consumer data, pegs data growth to be more than 181 zettabytes by 2025.
I don’t care how big your data center is, that’s an overwhelming amount of data! So, why on earth would you need to create artificial data on top of what you already have? Well, it comes down to the fact that data sets are by nature incomplete. Furthermore, some real-world data is extremely difficult, impossible, or too dangerous to capture.
Synthetic, augmented, and virtual data aren’t the same things as false-made-up-out-of-whole-cloth data here, although false data or manipulated data can definitely be injected into real-world and synthetic data sets. (Those are problems for cybersecurity and data validators to address.) Here I’m talking about creating data that you cannot easily, safely or affordably obtain through other means. For example, you might think that getting wind speed data from the blades of a wind turbine, like the ones shown in Figure 3-1, would be a simple matter of taking reads from a sensor on the blades. But what do you do if those sensors fail?You can’t safely send a repairperson to replace the sensor in the middle of a commercial wind turbine farm where the wind coming off the blades of numerous high-powered windmills can be at hurricane force. You can, however, infer data reads based on previous sensor data relative to neighboring wind turbine data in current weather conditions — filling in the missing data with values inferred from previous metrics and/or neighboring devices’ measurements, in other words. For example, one can infer without benefit of actually measuring it again that, since a specific structure measured 6 feet tall yesterday and it does not possess the ability to grow, that it is still 6 feet tall today. A better inference would also note that the structure has not toppled or sunk into the ground.
However, you can also create synthetic data sets based on known laws of physics, wind turbine specs, and other factors to create a simulation resulting in synthetic data that can be safely collected and used in decision-making. Most, but not all, synthetic data is created by simulations.
FIGURE 3-1: How fast are these spinning again?
Another example would be facial recognition data. Many countries regulate how much (if any) facial data can be taken or used without a person’s prior consent. This can significantly limit the amount of facial data available on which to train facial recognition machine learning models. To overcome a data shortage, companies turn to AI-generated faces of people who don’t actually exist. Data from fake faces also helps machine learning know how to determine which faces are real and which are not. The distinction can be useful in many endeavors,