a declarative approach into a new domain or a domain where solutions were typically expressed imperatively, rather than of replacing another database language.
Semantic Web. The idea of the Semantic Web emerged in late 1990s as the next step in the evolution of the World Wide Web. Instead of HTML pages, the Semantic Web was to rely on machine-readable logical statements (most of which would be just database facts written in an open standard format called RDF [Lassila and Swick 1999]), making search results more relevant, giving rise to better recommendation systems, making Semantic Web services possible (for example, automatic arrangement of trips), and even enabling automated contract negotiation.
The two main research directions in this area are reasoning and machine learning; we focus here on the former. In reasoning, two approaches dominated: OWL [Grau et al. 2008], which is based on a particular strain of knowledge representation, called Description Logic [Baader et al. 2003], and rule-based reasoning, which is based on logic programming and includes Datalog. Both approaches have led to W3C recommendations (W3C calls its standards “recommendations”)—the already mentioned OWL and the Rule Interchange Format (RIF) [Boley and Kifer 2013a, Boley and Kifer 2013b], which is based on logic programming and, more specifically, on F-logic and HiLog (see Section 1.4.6). The expressive powers of OWL and of rules are quite different, which is why both approaches exist, each having its adherents. This bifurcation of efforts on the reasoning side has also led to a number of efforts to reconcile OWL with rules. First, it resulted in the development of an OWL subset, called OWL RL,22 which lies in the intersection of OWL and Datalog. Second, several approaches attempted to combine OWL and logic programming at a much more general level so as to take advantage of both paradigms [Eiter et al. 2004a, Eiter et al. 2004b, Motik et al. 2006].
Overall, building the Semantic Web turned out to be harder than originally envisioned, but the effort is proceeding apace. It resulted in Google’s Knowledge Graph, DBpedia, Wikidata, Semantic MediaWiki, and other resources. In addition, it led to a number of commercial Datalog-based systems for RDF stores such as Ontotext’s GraphDB,23 Franz’s AllegroGraph,24 and Apache’s Jena,25 to name a few. Many leading logic programming systems (for example, SWI Prolog, Ciao Prolog, and XSB) also have extensive web-related capabilities.
Datalog for Program Analysis. Program analysis is an example of an application of Datalog to an area that had largely used imperative approaches before. Figuring out program properties, such as call chains or pointer aliasing, often involves graph traversal or mutual recursion, which are directly expressed in Datalog. Lam et al. [2005] relate their experience with implementing program analyses to detect security holes in web applications. They turned to Datalog after coding their analysis algorithms in Java proved hard to get correct and efficient. They also found their Datalog implementation easier to maintain and extend. Their evaluation approach to Datalog was to translate it to relational algebra and thence to Boolean functions. Relations with program-structure and execution-context information are represented as characteristic functions (mappings of tuples to true
or false
). Each function is encoded into a compact structure, called a binary decision diagram, upon which the Boolean functions can be efficiently evaluated. The CodeQuest system [Hajiyev et al. 2006] used Datalog to query program information, aimed in particular at enforcing style rules and programming conventions that the compiler does not enforce, such as requiring all subclasses of an abstract class to reimplement a particular method. The developers report that Datalog hits a “sweet spot” between expressiveness and efficiency. While not working with enormous datasets, the space requirements can exceed main memory, with code bases containing more than one million lines of code and greater than 10,000 classes. Their approach to evaluation is to send Datalog through an optimizing compiler that emits SQL, using iteration in stored procedures or SQL’s Common Table Expressions to handle recursion. The Semmle product takes a similar approach for code inspection and analysis. However, queries are written in an SQL-like language that is translated to Datalog and thence to relational database queries [de Moor et al. 2007]. The Doop system [Smaragdakis and Bravenboer 2011] also queries programs using Datalog, such as performing points-to analysis—determining all the memory locations to which a pointer variable might refer. They report that their analyses are as precise as previous methods, but faster. Their evaluation relies on the aggressive optimization of Datalog, as well as programmer-assisted indexing of EDB relations.
The use of Datalog for program analysis illustrates several themes that emerge across recent applications that motivate some of the renewed interest in the language. One is handling of complex reasoning. For example, program analysis involves mutual recursion among different properties being reported, such as pointsto, call graph and exception analysis. Also, the algorithms are expressed in a way that they can often be readily modified via adding an additional parameter in a predicate, or a few more rules. For example, Lam et al. [2005] point out that the Datalog rules for their context-sensitive points-to analysis are almost identical to their counterparts in context-insensitive points-to analysis. In contrast, capturing the reasoning and relationships for program analysis, along with appropriate control strategies, in an imperative programming language is a labor-intensive and error-prone task. Also, while the data sizes involved are not staggering, they are large enough that manual optimization of code may not be sufficient to deal with them efficiently. Another use of Datalog illustrated by the program-analysis domain is the use of Datalog as an intermediate form. It is often fairly easy to write a translator from a domain-specific syntax to Datalog. From there, a wide variety of implementation and optimization options are available. As we have seen, there are multiple control strategies based on top-down and bottom-up methods, and representation alternatives for data and rules, such as binary decision diagrams. Although not discussed extensively in this chapter, Datalog admits a range of parallel and distributed evaluation techniques. With these approaches, Datalog performance can match or exceed that of custom analysis tools [de Moor et al. 2007, Smaragdakis and Bravenboer 2011].
Declarative Networking. Datalog has also gained popularity for reasoning about networks. The general area of declarative networking [Loo et al. 2009] has grown to encompass network protocols, services, analysis, and monitoring. It is an area with many examples of recursive reasoning over both extensional (link tables) and intensional (state-transition rules) information. Describing network algorithms in Datalog exposes the essential differences between them, generates predicates that can be reused across algorithms, and suggests new alternatives [Loo et al. 2006]. Simple syntactic extensions of Datalog, such as location variables, can capture key aspects of the domain. A range of distributed evaluation techniques are available, including asynchronous data-flow computation derived from the convergence properties of Datalog. (Bottom-up evaluation of a monotonic Datalog program will always converge to the same result regardless of the order in which rules are applied.)
The P2 facility [Loo et al. 2005] supports the expression of overlay networks, in a (cyclic) data-flow framework using asynchronous messages. The authors found that using Datalog as a basis supported succinct expression of overlay algorithms. For example, the Chord overlay algorithm can be expressed in 47 rules, with many reusable parts, in contrast to finite-automaton-based approaches that are less concise and more monolithic, and the P2 version provides acceptable performance. The Datalog-based approach called out similarities and distinctions among different algorithms. (For example, a wired and wireless protocol were found to differ only in predicate order in one rule body.) Their approach also made hybrid algorithms easy to construct. Overlay specifications in P2 are written in Overlog, a Datalog-based language with extension for location attributes in tuples and continuous queries over a changing EDB (including handling of deletion). Overlog is translatable into data-flow graphs with local Datalog evaluation engines communicating via asynchronous messages.
Loo et al. [2006] extend the P2 work, looking at general techniques for prototyping, deploying and evolving continuous recursive queries over networks. In that work, the network is both the object of study and the vehicle for evaluation. Their language NDlog (for Network Datalog) emphasizes this latter aspect through both location flags in predicates and an explicit #link
EDB predicate that lists all direct node-to-node connections in the network. Every predicate in NDlog must have a location-specifier