query, and tries to match it with a fact or rule head in P. If it matches a fact, then a new answer has been established. If it matches a rule head, the rule body becomes the new query, and the process continues, trying to match the goals in the rule body. For example, the query
does not match any facts directly, but it does match the head of the rule
if we replace P1
by dsw
. The body of the rule leaves the new query
which matches the two facts
resulting in two answers:
Matching the original query with the other adjacent
rule gives two additional answers. We will cover evaluation techniques in more detail in Section 1.5.
Three final points before we leave this overview of Datalog. First, it is common to classify the tuples in the database specified by a Datalog program into two parts. The extensional database, or EDB, is all the tuples corresponding to explicit facts in the program, for example, advised(dsw, lvk)
. The intensional database, or IDB, is all the tuples established by the rules in the program, for example, related(lvk, ks)
. While the same relation could draw tuples from both the EDB and the IDB, many authors assume each relation to be strictly extensional or intensional. This assumption is not really a limitation, since a Datalog program where it does not hold is easily rewritten into one where it does and that specifies the same database.
The second is a restriction usually placed on Datalog programs that the rules be safe, in the sense that any fact in the IDB contains only the values that appear in the EDB or the rules themselves. For example, the rule
is not safe, because the value of P2
is unconstrained and therefore P2
can be bound to any value, not necessarily the ones that appear in EDB. Generally, safety is guaranteed by a syntactic condition, such as requiring that every variable in the rule head also appear in some predicate in the rule body.
The third point is a notational convenience. The first Datalog rule could be written more succinctly as:
In this formulation, since the actual names of the variables do not matter, logical variables are consistently renamed with 1-letter variable names. Furthermore, inventing names for the variables that occur only once in a rule is an unnecessary burden and the underscore is often used to denote “anonymous” variables. Each occurrence of “_
” is treated as a new variable name, which the parser generates automatically.
1.2 The Emergence of Datalog
Since Datalog is a subset of Prolog, one could create and execute Datalog programs as soon as Prolog evaluators existed—around 1972. However, as a named concept and object of study, Datalog emerged in the mid-1980s, following work in deductive databases in the late 1970s. (See Section 1.3 for more on the naming of Datalog.) Why did Datalog emerge as a topic of interest at that point? It was because Datalog served as a “sweet spot”—or “middle ground”—for related research lines from Logic Programming, Database Systems, and Artificial Intelligence.
Why was Datalog interesting to the three communities? It was because pure Datalog was very simple and had a clean syntax and semantics. Yet it was expressive enough to serve as the basis for theoretical investigations and examination of evaluation alternatives, as well as a foundation from which extensions could be explored and a starting point for knowledge-representation systems. We summarize relevant trends in each of the three communities below, following each by more specific background.
Logic Programming
Logic programmers saw relational databases as an implementation of an important sublanguage, and worked to integrate them into Prolog systems. Since early Prolog implementations assumed all rules and facts were memory-resident, it was clear that for very large fact bases, something like relational database technology was necessary. Sometimes this enhancement took the form of a connection to a relational database management system (RDBMS). Translators were written to convert a subset of Prolog programs into SQL to be passed to an RDBMS and evaluated there. Datalog was a more powerful, more Prolog-ish, data-oriented subset of Prolog than SQL. Datalog was also more “declarative” than Prolog and many in the Logic Programming community liked that aspect. With Prolog, there was a serious gap between Logic Programming theory (programs understood as logical axioms and deduction) and Logic Programming practice (programs as evaluated by Prolog interpreters). For one thing, Prolog contained a number of non-logical primitives; for another, many perfectly logical Prolog programs would not terminate due to the weaknesses of the evaluation strategy used by the Prolog interpreters. With Datalog, that gap almost disappeared—deduction largely coincided with evaluation in terms of the produced results. Also, Datalog provided a basis on which to work out solutions for recursion with negation, some of which could be applied to more general logic languages.
Background. Logic programming grew out of resolution theorem proving proposed by Robinson [1965] and, especially, out of a particularly simplified version of it, called SLD resolution [Lloyd 1993, Kowalski 1974], which worked for special cases of logic, such as Horn clauses.5 In the early 1970s, researchers began to realize that SLD resolution combined with backtracking provided a computational model. In particular, Colmerauer and collaborators developed Prolog [Colmerauer and Roussel 1996] and Kowalski made significant contributions to the theory of logic programming, as it came to be called [Kowalski 1988]. Prolog was the starting point for languages in the Japanese Fifth-Generation Computer Systems (FGCS) project in the early 1980s [Fuchi and Furukawa 1987, Moto-oka and Stone 1984]. The FGCS project sought advances in hardware, databases, parallel computing, deduction and user interfaces to build high-performance knowledge-base systems. In the context of this project, Fuchi [1981] describes Prolog as a basis for bringing together programming languages and database query languages. D. H. D. Warren [1982b] provides interesting insights into the focus on Prolog for the FGCS. The use of Prolog to express queries dates from around the same time. For example, the Chat-80 system [Warren and Pereira 1982] analyzed natural-language questions and turned them into Prolog clauses that could be evaluated against stored predicates. (All examples there are actually in Datalog.) Warren [1981] showed that such queries were amenable to some database-style optimizations via Prolog rewriting and annotation. Prolog itself had been proposed for database query. For example, Maier [1986b] considers Prolog as a database query language and notes its advantages—avoiding the “impedance mismatch” between DBMS and programming language, expressive power, ease of transformation—but also points out its limits in terms of data definition, update, secondary storage, concurrency control and recovery. Zaniolo [1986] also notes that Prolog can be used to write complete database applications, avoiding the impedance mismatch. He further proposes extensions to Prolog for use with a data model supporting entity identity. For a recent survey of the history of logic programming, see Kowalski [2014].
Database Systems
As the relational model gained traction in the 1980s, limitations on expressiveness of the query languages became widely recognized by researchers and practitioners. In particular, fairly common applications—such as the transitive closure of a graph and bill-of-materials roll ups (i.e., aggregation of costs, weights, etc. in a partsubpart hierarchy)—could not be expressed with a single query in most relational query languages. Various approaches for enhancing expressiveness to handle such cases were proposed, such as adding