You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
174 lines
8.6 KiB
ReStructuredText
174 lines
8.6 KiB
ReStructuredText
====================================
|
|
Query execution
|
|
====================================
|
|
|
|
Usually, in a database,
|
|
how queries are executed is usually considered an implementation detail
|
|
hidden behind an abstraction barrier, which normal users need not care about.
|
|
The idea is that databases will take advantage of this abstraction barrier
|
|
by using query optimizers to choose the best query plan,
|
|
regardless of how the query was originally written.
|
|
As everyone knows, however, this abstraction barrier is leaky,
|
|
since bad query execution plans invariably occur and hurt performance.
|
|
The problem is especially severe when dealing with graphs,
|
|
since graph traversals generally use a lot more joins than non-graph queries,
|
|
and the reliability of even the best query optimizer
|
|
decreases exponentially with the number of joins.
|
|
|
|
In Cozo we take the pragmatic approach and assume that the user eventually
|
|
knows (or should know) what is the best way to query their data.
|
|
This is certainly true for those developers who spend hours
|
|
"coercing" the query optimizers to use a query plan that the user intends,
|
|
sometimes in rather convoluted ways.
|
|
In Cozo, no coercion is necessary since the query execution is completely
|
|
determined by how the query is written:
|
|
there is no stats-based query planning involved.
|
|
In our experience, this saves quite a lot of developer time,
|
|
since developers eventually learn how to write efficient queries naturally,
|
|
and after they do, they no longer have to deal with endless "query de-optimizations".
|
|
|
|
--------------------------------------
|
|
Stratification
|
|
--------------------------------------
|
|
|
|
As discussed in the chapter on queries, Cozo sees a query as a set of named rules.
|
|
Fixed rules are left as they are,
|
|
and inline rules are converted into disjunctive normal forms.
|
|
After conversion, all inline rules consist of conjunction of atoms only,
|
|
and negation only occurs for the leaf atoms.
|
|
|
|
The next step towards executing the query is *stratifying* the rules.
|
|
Stratification begins by making a graph of the named rules,
|
|
with the rules themselves as nodes,
|
|
and a link is added between two nodes when one of the rules applies the other.
|
|
This application is through atoms for inline rules, and input relations for fixed rules.
|
|
Now some of the links are labelled *stratifying*:
|
|
when an inline rule applies another rule through negation,
|
|
when an inline rule applies another inline rule that contains aggregations,
|
|
when an inline rule applies itself and it has non-semi-lattice,
|
|
when an inline rule applies another rule which is a fixed rule,
|
|
or when a fixed rule has another rule as an input relation.
|
|
The strongly connected components of this graph are then determined and tested,
|
|
and if it found that some strongly connected component contains a stratifying link,
|
|
the graph is deemed *unstratifiable*, and the execution aborts.
|
|
Otherwise, Cozo will topologically sort the strongly connected components to
|
|
determine a *stratification* of the rules:
|
|
rules within the same stratum are logically executed together,
|
|
and no two rules within the same stratum can have a stratifying link between them.
|
|
In this process,
|
|
Cozo will merge the strongly connected components into as few supernodes as possible
|
|
while still maintaining the restriction on stratifying links.
|
|
The resulting strata are then passed on to be processed in the next step.
|
|
|
|
You can see the stratum number assigned to rules by using the ``::explain`` system op.
|
|
|
|
--------------------------------------
|
|
Magic set rewrites
|
|
--------------------------------------
|
|
Within each stratum, the input rules are rewritten using a technique called magic sets.
|
|
In intuitive terms, this rewriting is to ensure that the query execution does not
|
|
waste time calculating results that are then simply discarded.
|
|
As an example, consider::
|
|
|
|
reachable[a, b] := link[a, n]
|
|
reachable[a, b] := reachable[a, c], link[c, b]
|
|
?[r] := reachable['A', r]
|
|
|
|
Without magic set rewrites, the whole ``reachable`` relation is generated first,
|
|
then most of them are thrown away, keeping only those starting from ``'A'``.
|
|
Magic set avoids this problem. How the rewrite proceeds is rather technical,
|
|
but you can see the results in the output of ``::explain``.
|
|
|
|
The rewritten query is guaranteed to yield the same relation for ``?``,
|
|
and will in general yield fewer intermediate rows.
|
|
|
|
Currently, the rewrite applies only to inline rules without aggregations.
|
|
So for the moment being, you may need to manually constrain some of your rules.
|
|
|
|
--------------------------------------
|
|
Semi-naïve evaluation
|
|
--------------------------------------
|
|
|
|
Now each stratum contains either a single fixed rule or a set of inline rules.
|
|
The single fixed rule case is easy: just run the specific implementation of the rule.
|
|
In the case of the inline rules, each of the rules is assigned an output relation.
|
|
Assuming we know how to evaluate each rule given all the relations it depends on,
|
|
the semi-naïve algorithm can now be applied to the rules to yield all output rows.
|
|
|
|
The semi-naïve algorithm is a bottom-up evaluation strategy, meaning that it tries to deduce
|
|
all facts from a set of given facts.
|
|
By contrast, top-down strategies start with stated goals and try to find proof for the goals.
|
|
Bottom-up strategies have many advantages over top-down ones when the whole output of each rule
|
|
is needed, but may waste time generating unused facts if only some of the output is kept.
|
|
Magic set rewrites are introduced to eliminate precisely this weakness.
|
|
|
|
|
|
---------------------------------------
|
|
Ordering of atoms
|
|
---------------------------------------
|
|
|
|
Now we discuss how a single definition of an inline rule is evaluated.
|
|
We know from the query chapter that the body of the rule contains atoms,
|
|
and after conversion to disjunctive normal forms, all atoms are linked by conjunction,
|
|
and each atom can only be one of the following:
|
|
|
|
* an explicit unification,
|
|
* applying a rule or a stored relation,
|
|
* an expression, which should evaluate to a boolean,
|
|
* a negation of an application.
|
|
|
|
The first two cases may introduce fresh bindings, whereas the last two cannot.
|
|
The atoms are then reordered: all atoms that introduce new bindings stay where they are,
|
|
whereas all atoms that do not introduce new bindings are moved to the earliest possible place
|
|
where all their bindings are bound. In fact,
|
|
all atoms that introduce bindings correspond to
|
|
joining with a pre-existing relation followed by projections
|
|
in relational algebra, and all atoms that do not correspond to filters.
|
|
The idea is to apply filters as early as possible
|
|
to minimize the number of rows before joining with the next relation.
|
|
|
|
This procedure is completely deterministic.
|
|
When writing the body of rules, we therefore should aim to minimize the total number of rows generated.
|
|
A strategy that works almost in all cases is to put the most restrictive atoms which generate new bindings first,
|
|
as this can make the left relation in each join small.
|
|
|
|
---------------------------------------
|
|
Relations as indices
|
|
---------------------------------------
|
|
|
|
Next, we need to understand how a single atom which generates new bindings is processed.
|
|
|
|
For the case of unification, it is simple: the right-hand side of the unification,
|
|
which is an expression with all variables bound, is simply evaluated, and the result is joined
|
|
to the current relation (as in a ``map-cat`` operation in functional languages).
|
|
|
|
For the case of the application of relations,
|
|
the first thing to understand is that all relations in Cozo are conceptually trees.
|
|
All the bindings of relations generated by inline or fixed rules,
|
|
and the keys of stored relations, act as a composite key for the tree.
|
|
The access complexity is therefore determined by whether a key component is bound.
|
|
For example, the following application::
|
|
|
|
a_rule['A', 'B', c]
|
|
|
|
with ``c`` unbound is very efficient, since this corresponds to a prefix scan in the tree with the key prefix ``['A', 'B']``,
|
|
whereas the following application::
|
|
|
|
a_rule[a, 'B', 'C']
|
|
|
|
where ``a`` is unbound is very expensive, since we must do a full relation scan.
|
|
On the other hand, if ``a`` is bound, then this is only a logarithmic-time check.
|
|
|
|
For stored relations, you need to check its schema for the order of keys to deduce the complexity.
|
|
The system op ``::explain`` may also give you some information.
|
|
|
|
---------------------------------------
|
|
Early stopping
|
|
---------------------------------------
|
|
|
|
Within each stratum, rows are generated in a streaming fashion.
|
|
For the entry rule ``?``, if ``:limit`` is specified as a query option,
|
|
a counter is used to monitor how many valid rows are already generated.
|
|
If enough rows are generated, the query stops.
|
|
Note that this only works when the entry rule is inline,
|
|
and when you are *not* specifying ``:order``. |