terms of asymptotic time and space complexities, Tekle and Liu [2010, 2011] have recently established the precise relationship between (1) bottom-up evaluation using Demand Transformation and Subsumptive Demand Transformation, as well as Magic Sets, and (2) top-down evaluation using variant tabling and subsumptive tabling. However, the actual performance varies widely due to constant factors, and performance of internal data structures. Faced with the ambiguity of such analyses, some Datalog systems include both bottom-up and top-down methods, though determining the best method and optimizations to apply to a particular Datalog program and dataset is far from a solved problem.
1.5.4 Evaluation Methods Using Database Systems
An obvious approach to scaling is to use a relational DBMS to manage the ground facts in a Datalog program, which are easily stored as rows in a table, where each predicate has a separate table. The DBMS might be used just for secondary-storage management and indexing capabilities, or for more general query processing and evaluation. As an example of the former, consider top-down evaluation of the query related(lvk, Y)
. If we are considering the rule
we will need to solve the sub-goal advised(A, lvk)
. If advised
is stored in a DBMS, and indexed on the second position, the database could quickly retrieve all facts having lvk
as the value in the second position. Even if we were using a very simple top-down strategy, and only considering advised(A,lvk)
-facts one at a time, bringing them into memory in a batch and caching them can save I/O time. As an example of the more general strategy, consider a rule-compilation approach. When working on a query, the evaluator can accumulate EDB goals, while trying to solve IDB goals with rules. Then, when a goal list contains only EDB goals, it can be converted to a database query. For example, if for the query related(lvk, Y)
we try to solve the goal using the rule
we obtain a goal list advised(B, lvk), related(B, Y)
. If we then, in turn, solve the related/2
goal with the previous rule, the resulting goal list is
which consists entirely of EDB goals. This goal list can be converted to the SQL query
If there are multiple ways to solve IDB goals, then there will be multiple database queries. (In the presence of recursion, the number of queries may be unbounded, however.)
Bottom-up evaluation can also make use of DBMS capabilities. The database can handle all or part of evaluation of each application of TP for a Datalog program P. A rule that mentions only EDB predicates can be translated into an SQL query. A rule that contains one or more IDB predicates needs to also access newly derived facts during right-to-left evaluation. We can create temporary tables in the database for each IDB predicate, to hold derived IDB facts, thereby having TP execute entirely within the database. Alternatively, a bottom-up evaluator might hold IDB facts in its own memory, and evaluate each rule with a nested-loop approach, issuing database queries for EDB facts, taking into account each combination of bindings determined by the predicates outside of those queries. We discuss specific instances of these approaches in the following paragraphs—first TD methods and then BU methods.
In terms of specific approaches, top-down methods tend to be more “proof” oriented, driven by derivation, whereas bottom-up methods are more “syntactic,” based on the structure of the program rules. Reiter [1977a] talks about an initial “compilation” phase where the IDB is consulted, resulting in a set of queries against the EDB. More specifically, he describes feeding the IDB to a theorem prover that flags EDB literals for later evaluation, and proposes that evaluation take place in a relational database. (Reiter notes that his particular method will not be complete for recursive queries.) The DADM [Kellogg et al. 1977] and DEDUCE 2 [Chang 1977] systems also present approaches where the IDB rules are first processed to yield (perhaps multiple) goal lists of EDB literals that can be evaluated against a relational database. Cuppens and Demolombe [1988] describe an accumulation approach to collect maximal sequences of EDB predicates for database evaluation. Grant and Minker [Grant and Minker 1981] note that the collection of EDB queries generated by such approaches will often have shared subexpressions that can be factored out for efficiency. Ceri et al. [1989] address overlap of database retrievals by caching the results returned from queries. The BERMUDA system [Ioannidis et al. 1988] takes a similar approach, collecting sequences of EDB predicates for evaluation by the database, caching results in a file, and having a “loader” component provide access to answers to a Prolog interpreter on demand.
On the bottom-up side, it has long been known that the immediate consequence operator TP for a Datalog program P can be translated into a relational-algebra expression. Ceri et al. [1986] show that a complete translation of bottom-up evaluation can be obtained by extending relational algebra with a fixpoint operator. Ceri and Tanca [1987] discuss various optimizations that can be applied to such a translation. The PRISMAlog language [Houtsma and Apers 1992] was a Datalog extension that was also translated to relational algebra extended with a general fixpoint operator, then optimized into standard relational-algebra expressions with possibly transitive closure over them. Relational algebra generally needs to be translated to SQL before sending it to a relational DBMS. Draxler [1993] provides a direct translation of a logic language into SQL, bypassing relational algebra. The NED-2 Intelligent Information System [Maier et al. 2002] introduced a specialized syntax for database queries in a logic language that are transmitted to the DBMS as SQL. Finally, van Emde Boas and van Emde Boas [1986] describe an approach to extending a relational DBMS (IBM’s Business System 12) with logic programming functionality, but it is unclear whether the project got beyond the prototype stage.
1.5.5 Query Optimization in Datalog
Optimization of Datalog queries has been studied from a theoretical perspective as well, mostly in the 1980s. Since the optimization problem is equivalent to finding another Datalog program that computes an equivalent set of facts given a set of facts for extensional predicates, but is faster to evaluate, optimization has been mainly looked at in the context of query equivalence or, more generally, query containment. Query containment means that the results of one query are always guaranteed to be a subset of the results of another query. Showing containment in both directions proves equivalence. Many Datalog optimization techniques involve simplifying the original Datalog program by removing rules or subgoals and then testing for equivalence with the original. Unfortunately, the equivalence problem for Datalog queries has been shown to be undecidable [Shmueli 1993]. Furthermore, query equivalence turned out to be undecidable for many subclasses of Datalog, e.g., even if each rule contains at most one recursive predicate [Gaifman et al. 1993]. Thus, fragments of Datalog for which the query containment problem is decidable have been of strong interest. In particular, Cosmadakis et al. [1988] have shown that, for monadic Datalog, where only predicates with one arguments are recursively defined, query containment is decidable and that if binary relations are allowed to be recursively defined then the problem is undecidable. A stronger version of query equivalence, called uniform equivalence, is the decision problem of whether two Datalog programs compute an equivalent set of facts given initial facts for both intensional and extensional predicates. Sagiv [1987] showed that uniform equivalence is decidable, and provided an algorithm for minimizing a set of Datalog rules with respect to uniform equivalence. For an even simpler subset of Datalog, with no recursion altogether, called union of conjunctive queries, many positive results regarding query optimization have been obtained, especially in the context of SQL [Chaudhuri 1998]. More recently, Vardi [2016] showed that the query containment problem for union of conjunctive queries extended with transitive closure is also decidable. For Datalog optimization techniques that involve constraints, see Section 1.4.5. Another class of equivalence results looks at whether a (syntactically) recursive Datalog program has a non-recursive