target="_blank" rel="nofollow" href="#fb3_img_img_60036427-ef5d-5a42-8c62-242f2aed6dd5.png" alt="Remember"/> When thinking about big data, you also consider anonymity. Big data presents privacy concerns. However, because of the way machine learning works, knowing specifics about individuals isn’t particularly helpful anyway. Machine learning is all about determining patterns — analyzing training data in such a manner that the trained algorithm can perform tasks that the developer didn’t originally program it to do. Personal data has no place in such an environment.
Finally, big data is so large that humans can’t reasonably visualize it without help. Part of what defines big data as big is the fact that a human can learn something from it, but the sheer magnitude of the dataset makes recognition of the patterns impossible (or would take a really long time to accomplish). Machine learning helps humans make sense of and use big data.
Considering the Sources of Big Data
Before you can use big data for a machine learning application, you need a source of big data. Of course, the first thing that most developers think about is the huge, corporate-owned database, which could contain interesting information, but it’s just one source. The fact of the matter is that your corporate databases might not even contain particularly useful data for a specific need. The following sections describe locations you can use to obtain additional big data.
Building a new data source
To create viable sources of big data for specific needs, you might find that you actually need to create a new data source. Developers built existing data sources around the needs of the client-server architecture in many cases, and these sources may not work well for machine learning scenarios because they lack the required depth (being optimized to save space on hard drives does have disadvantages). In addition, as you become more adept in using machine learning, you find that you ask questions that standard corporate databases can’t answer. With this in mind, the following sections describe some interesting new sources for big data.
Obtaining data from public sources
Governments, universities, nonprofit organizations, and other entities often maintain publicly available databases that you can use alone or combined with other databases to create big data for machine learning. For example, you can combine several Geographic Information Systems (GIS) to help create the big data required to make decisions such as where to put new stores or factories. The machine learning algorithm can take all sorts of information into account — everything from the amount of taxes you have to pay to the elevation of the land (which can contribute to making your store easier to see).
The best part about using public data is that it’s usually free, even for commercial use (or you pay a nominal fee for it). In addition, many of the organizations that created them maintain these sources in nearly perfect condition because the organization has a mandate, uses the data to attract income, or uses the data internally. When obtaining public source data, you need to consider a number of issues to ensure that you actually get something useful. Here are some of the criteria you should think about when making a decision:
The cost, if any, of using the data source
The formatting of the data source
Access to the data source (which means having the proper infrastructure in place, such as an Internet connection when using Twitter data)
Permission to use the data source (some data sources are copyrighted)
Potential issues in cleaning the data to make it useful for machine learning
Potential security issues in accessing the data, adding it to other data sources, and managing it locally
Ensuring that the data is the original data, rather than data that purports to be original but has been biased or modified in other ways that would change the results of using it
Determining that the data doesn’t contain personally identifiable information that the data source originator may not have permission to use. (Chapter 22 covers issues like this one.)
Obtaining data from private sources
You can obtain data from private organizations such as Amazon (see Open Data, https://aws.amazon.com/opendata/
) and Google (see Public Data Explorer, https://www.google.com/publicdata/directory
), both of which maintain immense databases that contain all sorts of useful information. In some cases, except for publicly shared data sources, you should expect to pay for access to the data, especially when used in a commercial setting. You may not be allowed to download the data to your personal servers, so that restriction may affect how you use the data in a machine learning environment. For example, some algorithms work slower with data that they must access in small pieces.
The biggest advantage of using data from a private source is that you can expect better consistency. The data is likely cleaner than from a public source. In addition, you usually have access to a larger database with a greater variety of data types. Of course, it all depends on where you get the data.
Creating new data from existing data
Your existing data may not work well for machine learning scenarios, but that doesn’t keep you from creating a new data source using the old data as a starting point. For example, you might find that you have a customer database that contains all the customer orders, but the data isn’t useful for machine learning because it lacks tags required to group the data into specific types. One of the new job types that you can expect to create is people who massage data to make it better suited for machine learning — including the addition of specific information types such as tags.
Machine learning will have a significant effect on your business. The article at
https://www.computerworld.com/article/3007053/big-data/how-machine-learning-will-affect-your-business.html
describes some of the ways in which you can expect machine learning to change how you do business. One of the points in this article is that machine learning typically works on 80 percent of the data. In 20 percent of the cases, you still need humans to take over the job of deciding just how to react to the data and then act upon it. The point is that machine learning saves money by taking over repetitious tasks that humans don’t really want to do in the first place (making them inefficient). However, machine learning doesn’t get rid of the need for humans completely, and it creates the need for new types of jobs that are a bit more interesting than the ones that machine learning has taken over. Also important to consider is that you need more humans at the outset until the modifications they make train the algorithm to understand what sorts of changes to make to the data.
Using existing data sources
Your organization has data hidden in all sorts of places. The problem is in recognizing the data as data. For example, you may have sensors on an assembly line that track how products move through the assembly process and ensure that the assembly line remains efficient. Those same sensors can potentially feed information into a machine learning scenario because they could provide inputs on how product movement affects customer satisfaction or the price you pay for postage. The idea is to discover how to create mashups that present existing data as a new kind of data that lets you do more to make your organization work well.