Dean Allemang

Semantic Web for the Working Ontologist


Скачать книгу

already aligned into a single representation. This is an important step that allows machine learning algorithms to generalize the data.

      The only example in this list where the data is distributed is medicine, where diagnoses come from hospitals and clinics from around the world. This is not an accident; in the case of medicine, disease and treatment codes have been in place for decades to align data from multiple sources.

      How can we extend the successful example of machine learning in medicine, to take our machine learning successes from the enterprise level to the industrial level in other industries? We need a way to link together data that is distributed throughout an industry.

      In the restaurant example, we had data (opening hours, daily special, holiday closings) published so that they can be read by the human eye, but our automated assistant couldn’t read them. One solution would be to develop sophisticated algorithms that can read web pages and figure out the opening hours based on what it sees there. But the restaurant owner knows the hours, and wants prospective patrons to find them, and for them to be accurate. Why should a restaurant owner rely on some third party to facilitate communication to their customers?

      A scientific paper that reports on an experimental finding has a very specific audience: other researchers who need to know about that compound and how it reacts in certain circumstances. It behooves both the author and the reader to match these up. Once again, the author does not want to rely on someone else to communicate their value.

      This story repeats at every level; a bank has more control over its own instruments if it can communicate their terms in a clear and unambiguous way (to partners, clients, or regulators). The IAU’s charter is to keep the astronomical community informed about developments in observations and classifications. Dentists want their patients to be able to find their clinics.

      The unifying theme in all of these examples is a move from a presentation of information for a specific audience, requiring interpretation from a human being, to an exchange of data between machines. Instead of relying on human intuition just in the interpretation of the data, we meet half-way: have data providers make it easier to consume the data. We take advantage of the desire to share data, to make it easier to consume.

      Instead of thinking of each data source as a single point that is communicating one thing to one person, it is a multi-use part of an interconnected network of data. Human users and machine applications that want to make use of this data collaborate with data providers, taking advantage of the fact that it is profitable to share your data.

       A distributed web of data

      The Semantic Web takes this idea one step further, applying it to the Web as a whole. The Web architecture we are familiar with supports a distributed network of hypertext pages that can refer to one another with global links called Uniform Resource Locators (URLs). The Web architecture generalizes this notion to a Uniform Resource Identifier (URI), allowing it to be used in contexts beyond the hypertext Web.

      The main idea of the Semantic Web is to support a distributed Web at the level of the data rather than at the level of the presentation. Instead of just having one web page point to another, one data item can point to another, using the same global reference mechanism that the Web uses—URIs. When Mongotel publishes information about its hotels and their locations, or when Copious publishes its opening hour, they don’t just publish a human-readable presentation of this information but instead a distributable, machine-readable description of the data.

      The Semantic Web faces the problem of distributed data head-on. Just as the hypertext Web changed how we think about availability of documents, the Semantic Web is a radical way of thinking about data. At first blush, distributed data seems easy: just put databases all over the Web (data on the Web). But in order for this to act as a distributed web of data, we have to understand the dynamics of sharing data among multiple stakeholders across a diverse world. Different sources can agree or disagree, and data can be combined from different sources to gain more insight about a single topic.

      Even within a single company, data can be considered as a distributed resource. Multiple databases, from different business units, or from parts of the business that were acquired through merger or corporate buy-out, can be just as disparate as sources from across the Web. Distributed data means that the data comes from multiple stakeholders, and we need to understand how to bring the data together in a meaningful way.

      Broadly speaking, data makes a statement that relates one thing to another, in some way. Copious (one thing) opens (a way to relate to something else) at 5:00pm (another thing, this time a value). They serve (another way to relate to something) chicken and waffles (this time, a dish), which itself is made up (another way to relate) of some other things (chicken, waffles, and a few others not in its name, like maple syrup). Any of these things can be represented at any source in a distributed web of data. The data model that the Semantic Web uses to represent this distributed web of data is called the Resource Description Framework (RDF) and is the topic of Chapter 3.

       Features of a Semantic Web

      The WWW was the result of a radical new way of thinking about sharing information. These ideas seem familiar now, as the Web itself has become pervasive. But this radical new way of thinking has even more profound ramifications when it is applied to a web of data like the Semantic Web. These ramifications have driven many of the design decisions for the Semantic Web standards and have a strong influence on the craft of producing quality Semantic Web applications.

       Give me a voice…

      On the WWW, publication is by and large in the hands of the content producer. People can build their own web page and say whatever they want on it. A wide range of opinions on any topic can be found; it is up to the reader to come to a conclusion about what to believe. The Web is the ultimate example of the warning caveat emptor (“Let the buyer beware”). This feature of the Web is so instrumental in its character that we give it a name: the AAA Slogan: “Anyone can say Anything about Any topic.”

      In a web of hypertext, the AAA slogan means that anyone can write a page saying whatever they please and publish it to the Web infrastructure. In the case of the Semantic Web, it means that our architecture has to allow any individual to express a piece of data about some entity in a way that can be combined with data from other sources. This requirement sets some of the foundation for the design of RDF.

      It also means that the Web is like a data wilderness—full of valuable treasure, but overgrown and tangled. Even the valuable data that you can find can take any of a number of forms, adapted to its own part of the wilderness. In contrast to the situation in a large, corporate data center, where one database administrator rules with an iron hand over any addition or modification to the database, the Web has no gatekeeper. Anything and everything can grow there. A distributed web of data is an organic system, with contributions coming from all sources. While this can be maddening for someone trying to make sense of information on the Web, this freedom of expression on the Web is what allowed it to take off as a bottom-up, grassroots phenomenon.

       … So I may speak!

      In the early days of the hypertext Web, it was common for skeptics, hearing for the first time about the possibilities of a worldwide distributed web full of hyperlinked pages on every topic, to ask, “But who is going to create all that content? Someone has to write those web pages!”

      To the surprise of those skeptics, and even of many proponents of the Web, the answer to this question was that everyone would provide the content. Once the Web infrastructure was in place (so that Anyone could say Anything about Any topic), people came out of the woodwork to do just that. Soon every topic under the sun had a web page, either official or unofficial. It turns out that a lot of people had something to say, and they were willing to put some work into saying it. As this trend continued, it resulted in collaborative “crowdsourced” resources like Wikipedia and the Internet Movie Database (IMDb)—collaboratively edited information sources with