you go to their web site, and find a list of their locations. Some of them are near the park. Why didn’t the park know about that? How could Mongotel have published its locations in a way that the park’s web site could have found them?
Going one step further, you want to figure out which of your hotel locations is nearest to the park. You have the address of the park, and the addresses of your hotel locations. And you have any number of mapping services on the Web. One of them shows the park, and some hotels nearby, but they don’t have all the Mongotel locations. So you spend some time copying and pasting the addresses from the Mongotel page to the map, and you do the same for the park. You think to yourself, “Why should I be the one to copy this information from one page to another? Whose job is it to keep this information up to date?” Of course, Mongotel would be very happy if the data on the mapping page would be up to date. What can they do to make this happen?
Suppose you are maintaining an amateur astronomy resource, and you have a section about our solar system. You organize news and other information about objects in the solar system: stars (well, there’s just one of those), planets, moons, asteroids, and comets. Each object has its own web page, with photos, essential information (mass, albedo, distance from the sun, shape, size, what object it revolves around, period of rotation, period of revolution, etc.), and news about recent findings, observations, and so on. You source your information from various places; the reference data comes from the International Astronomical Union (IAU), and the news comes from a number of feeds.
One day, you read in the newspaper that the IAU has decided that Pluto, which up until 2006 was considered a planet, should be considered a member of a new category called a “dwarf planet”! You will need to update your web pages, since not only has the information on some page changed, but so has the way you organize it; in addition to your pages about planets, moons, asteroids, and so on, you’ll need a new page about “dwarf planets.” But your page about planets takes its information from the IAU already. Is there something they could do, so that your planet page would list the correct eight planets, without you having to re-organize anything?
You have an appointment with your dentist, and you want to look them up. You remember where the office is (somewhere on East Long Street) and the name of the dentist, but you don’t remember the name of the clinic. So you look for dentists on Long Street. None of them ring a bell. When you finally find their web page, you see that they list themselves as “oral surgeons,” not dentists. Whose job is it to know all the ways a dentist might list themselves?
You are a scientist researching a particular medical condition, whose metabolic process is well understood. From this process, you know a number of compounds that play a role in the process. Researchers around the world have published experimental results about organic compounds linked to human metabolism. Have any experiments been done about any of the compounds you are interested in? What did they measure? How can the scientists of the world publish their data so that you can find it?
Tigerbank lends money to homeowners in the form of mortgages, as does Makobank; some of them are at fixed interest rates, and some float according to a published index. A clever financial engineer puts together a deal where one of Tigerbank’s fixed loan payments is traded for one of Makobank’s floating loan payments. These deals make sense for people who want to mitigate the different risk profiles of these loans. Is this sort of swap a good deal or not? We have to compare the terms of Tigerbank’s loan with those of Makobank’s loan. How can the banks describe their loans in terms that participants can use to compare them?
What do these examples have in common? In each case, someone has knowledge of something that they want to share. It might be about their business (hours, daily special, locations, business category), or scientific data (experimental data about compounds, the classification of a planet), or information about complex instruments that they have built (financial instruments). It is in the best interests of the entities with the data to publicize it to a community of possible consumers, and make it available via many channels: the web page itself, but also via search engines, personal assistants, mash-ups, review sites, maps, and so on. But the data is too idiosyncratic, or too detailed, or just too complex to simply publicize by writing a description of it. In fact, it is so much in their interest to get this data out, that they are willing to put some effort into finding the right people who need their data and how they can use it.
Social data
A special case of the desire to share data is social networking. Billions of people share data about their lives on a number of social web sites, including their personal lives as well as their professional lives. It is worth their while to share this data, as it provides ways for them to find new friends, keep in touch with old friends, find business connections, and many other advantages.
Social and professional networking is done in a non-distributed way. Someone who wants to share their professional or personal information signs up for a web service (common ones today include Facebook, LinkedIn, Instagram, and WeChat; others have come and gone, and more will probably appear as time goes on), creates an account that they have control of, and they provide data, in the form of daily updates, photos, tags of places they’ve been and people they’ve been with, projects they have started or completed, jobs they have done, and so on. This data is published for their friends and colleagues, and indeed in some cases for perfect strangers, to search and view.
In these cases, the service they signed up for owns the data, and can use it for various purposes. Most people have experienced the eerie effect of having mentioned something in a social network, only to find a related advertisement appear on their page the following day.
Advertising is a lucrative but mostly harmless use of this data. In 2018, it was discovered that data from Facebook for millions of users had been used to influence a number of high-profile elections around the world, including the US presidential election of 2016 and the so-called “Brexit” referendum in the UK [Meredith 2018]. Many users were surprised that this could happen; they shared their data in a centralized repository over which they had no control.
This example shows the need for a balance of control—yes, I want to share my data in the examples of Section 1.2, and I want to share it with certain people but not with others (as is the case in this section). How can we manage both of these desires? This is a problem of distributed data; I need to keep data to myself if I want to control it, but it has to connect to data around the world to satisfy the reasons why I publish it in the first place.
Learning from data
Data Science has become one of the most productive ways to make business predictions, and is used across many industries, to make predictions for marketing, demand, evaluation of risk, and many other settings in which it is productive to be able to predict how some person will behave or how well some product will perform.
Banking provides some simple examples. A bank is in the business of making loans, sometimes mortgages for homeowners, or automobile loans, small-business loans, and so on. As part of the loan application process, the bank learns a good deal about the borrower. Most banks have been making loans for many decades, and have plenty of data about the eventual disposition of these loans (for example, Were they defaulted? Did they pay off early? Were they refinanced?). By gathering large amounts of this data, machine learning techniques can predict the eventual disposition of a loan based on information gathered at the outset. This, in turn, allows the bank to be more selective in the loans it makes, allowing it to be more effective in its market.
This basic approach has been applied to marketing (identifying more likely sales leads), product development (identifying which features will sell best), customer retention (identifying problems before they become too severe to deal with), medicine (identifying diseases based on patterns in images and blood tests), route planning (finding best routes for airplanes), sports (deciding which players to use at what time), and many other high-profile applications.
In all of these cases, success relied on the availability of meaningful data. In the case of marketing, sales, and manufacturing applications, the data comes from a single source, that is, the sales behavior of the customers of a single company. In the case of sports, the statistical data for the sport has been normalized by sports