You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

327 lines
16 KiB
ReStructuredText

==============
Queries
==============
The Cozo database system is queried using the CozoScript language.
At its core, CozoScript is a `Datalog <https://en.wikipedia.org/wiki/Datalog>`_ dialect
supporting stratified negation and stratified recursive meet-aggregations.
The built-in native algorithms (mainly graph algorithms) further empower
CozoScript for much greater ease of use and much wider applicability.
A query consists of one or many named rules.
Each named rule conceptually represents a relation or a table with rows and columns.
The rule named ``?`` is called the entry to the query,
and its associated relation is returned as the result of the query.
Each named rule has associated with it a rule head, which names the columns of the relation,
and a rule body, which specifies the content of the relation, or how the content should be computed.
In CozoScript, relations (stored relations or relations defined by rules) abide by the *set semantics*,
meaning that even if a rule may compute a row multiple times, it will occur only once in the output.
This is in contradistinction to SQL.
There are two types of named rules in CozoScript: Horn-clause rules and algorithm applications.
You may think that there is a third type: constant rule, written as::
const_rule[a, b, c] <- [[1, 2, 3], [4, 5, 6]]
but this is actually syntax sugar for the algorithm application of the ``Constant`` algorithm.
-----------------
Horn-clause rules
-----------------
An example of a Horn-clause rule is::
hc_rule[a, e] := rule_a['constant_string', b], rule_b[b, d, a, e]
As can be seen, Horn-clause rules are distinguished by the symbol ``:=`` separating the rule head and rule body.
The rule body of a Horn-clause rule consists of multiple *atoms* joined by commas,
and is interpreted as representing the *conjunction* of these atoms.
^^^^^^^^^^^^^^
Atoms
^^^^^^^^^^^^^^
Atoms come in various flavours.
In the example above::
rule_a['constant_string', b]
is an atom representing a *rule application*: a rule named ``rule_a`` must exist in the same query
and have the correct arity (2 here).
Each row in the named rule is then *unified* with the bindings given as parameters in the square bracket:
here the first column is unified with a constant string, and unification succeeds only when the string
completely matches what is given;
the second column is unified with the *variable* ``b``,
and as the variable is fresh at this point (meaning that it first appears here),
the unification will always succeed and the variable will become *bound*:
from this point take on the value of whatever it was
unified with in the named relation.
When a bound variable is used again later, for example in ``rule_b[b, d, a, e]``, the variable ``b`` was bound
at this point, this unification will only succeed when the unified value is the same as the previously unified value.
In other words, repeated use of the same variable in named rules corresponds to inner joins in relational algebra.
Another flavour of atoms is the *stored relation*. It may be written similarly to a rule application::
:stored_relation[bind1, bind2]
with the colon in front of the stored relation name to distinguish it from rule application.
Written in this way, you must give as many bindings to the stored relation as its arity,
and the bindings proceed by argument positions, which may be cumbersome and error-prone.
So alternatively, you may use the fact that columns of a stored relation are always named and bind by name::
:stored_relation{col1: bind1, col2: bind2}
In this case, you only need to bind as many variables as you use.
If the name you want to give the binding is the same as the name of the column, you may use the shorthand notation:
``:stored_relation{col1}`` is the same as ``:stored_relation{col1: col1}``.
*Expressions* are also atoms, such as::
a > b + 1
Here ``a`` and ``b`` must be bound somewhere else in the rule, and the expression must evaluate to a boolean, and act as a *filter*: only rows where the expression evaluates to true are kept.
You can also use *unification atoms* to unify explicitly::
a = b + c + d
for such atoms, whatever appears on the left-hand side must be a single variable and is unified with the right-hand side.
This is different from the equality operator ``==``,
where both sides are merely required to be expressions.
When the left-hand side is a single *bound* variable,
it may be shown that the equality and the unification operators are semantically equivalent.
Another form of *unification atom* is the explicit multi-unification::
a in [x, y, z]
here the variable on the left-hand side of ``in`` is unified with each item on the right-hand side in turn,
which in turn implies that the right-hand side must evaluate to a list
(but may be represented by a single variable or a function call).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Head and returned relation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Atoms, as explained above, corresponds to either relations (or their projections) or filters in relational algebra.
Linked by commas, they, therefore, represent a joined relation, with named columns.
The *head* of the rule, which in the simplest case is just a list of variables,
then defines whichever columns to keep, and their order in the output relation.
Each variable in the head must be bound in the body, this is one of the *safety rules* of Datalog.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Multiple definitions and disjunction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For Horn-clause rules only, multiple rule definitions may share the same name,
with the requirement that the arity of the head in each definition must match.
The returned relation is then the *disjunction* of the multiple definitions,
which correspond to *union* in SQL.
*Intersect* in SQL can be written in CozoScript into a single rule since commas denote conjunction.
In complicated situations, you may instead write disjunctions in a single rule with the explicit ``or`` operator::
rule1[a, b] := rule2[a] or rule3[a], rule4[a, b]
For completeness, there is also an explicit ``and`` operator, but it is semantically identical to the comma, except that
it has higher operator precedence than ``or``, which in turn has higher operator precedence than the comma.
During evaluation, each rule is canonicalized into disjunction normal form
and each clause of the outmost disjunction is treated as a separate rule.
The consequence is that the safety rule may be violated
even though textually every variable in the head occurs in the body.
As an example::
rule[a, b] := rule1[a] or rule2[b]
is a violation of the safety rule since it is rewritten into two rules, each of which is missing a different binding.
^^^^^^^^^^^^^^^^
Negation
^^^^^^^^^^^^^^^^
Atoms in Horn clauses may be *negated* by putting ``not`` in front of them, as in::
not rule1[a, b]
When negating rule applications and stored relations,
at least one binding must be bound somewhere else in the rule in a non-negated context:
this is another safety rule of Datalog, and it ensures that the outputs of rules are always finite.
The unbound bindings in negated rules remain unbound: negation cannot introduce bound bindings to be used in the head.
Negated expressions act as negative filters,
which is semantically equivalent to putting ``!`` in front of the expression.
Since negation does not introduce new bindings,
unifications and multi-unifications are converted to equivalent expressions and then negated.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Recursion and stratification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The body of a Horn-clause rule may contain rule applications of itself,
and multiple Horn-clause rules may apply each other recursively.
The only exception is the entry rule ``?``, which cannot be referred to by other rules.
Self and mutual references allow recursion to be defined easily. To guard against semantically pathological cases,
recursion cannot occur in negated positions: the Russell-style rule ``r[a] := not r[a]`` is not allowed.
This requirement creates an ordering of the rules, since
negated rules must evaluate to completion before rules that apply them can start evaluation:
this is called *stratification* of the rules.
In cases where a total ordering cannot be defined since there exists a loop in the ordering
required by negation, the query is then deemed unstratifiable and Cozo will refuse to execute it.
Note that since CozoScript allows unifying fresh variables, you can still easily write programs that produce
infinite relations and hence cannot complete through recursion, but that are still accepted by the database.
One of the simplest examples is::
r[a] := a = 0
r[a] := r[b], a = b + 1
?[a] := r[a]
It is up to the user to ensure that such programs are not submitted to the database,
as it is not even in principle possible for the database to rule out such cases without wrongly rejecting valid queries.
If you accidentally submitted one, you can refer to the system ops section for how to terminate long-running queries.
Or you can give a timeout for the query when you submit.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Aggregation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
CozoScript supports aggregations, as does SQL, which provides a very useful extension to pure relational algebra.
In CozoScript, aggregations are specified for Horn-clause rules by applying aggregation operators to variables
in the rule head, as in::
?[department, count(employee)] := :personnel{department, employee}
here we have used the ``count`` operator familiar to all SQL users.
The semantics is that any variables in the head without aggregation operators are treated as *grouping variables*,
similar to what appears in a ``GROUP BY`` clause in SQL, and the aggregation is applied using the grouping variables
as keys. If you do not specify any grouping variables, then you get at most one row as the return value.
As we now understand, CozoScript follows relational algebra with set semantics.
With the introduction of aggregations, the situation is a little bit more complicated,
as aggregations are applied to the relation resulting from the body of the rule using bag semantics,
and the resulting relation of the rule, after aggregations are applied, follows set semantics.
The reason for this complication is that if aggregations are applied with set semantics, then the following query::
?[count(employee)] := :personnel{employee}
does not do what you expect: it either returns a row with a single value ``1`` if there are any matching rows,
or it returns nothing at all if the stored relation ``:personnel`` is empty.
Though semantically sound, this behaviour is not useful at all.
So for aggregations, we opt for bag semantics, and the query does what one expects.
If a rule has several definitions, they must have identical aggregations applied in the same positions,
otherwise, the query will be rejected.
The reason is that in complicated situations the semantics is ambiguous and counter-intuitive if we do allow it.
Existing database systems do not usually allow aggregations through recursion,
since in many cases, it is difficult to give useful semantics to such queries.
In CozoScript we allow aggregations for self-recursion for a limited subset of aggregation operators,
the so-called *meet aggregations*, such as the following example shows::
shortest_distance[destination, min(distance)] := route{source: 'A', destination, distance}
shortest_distance[destination, min(distance)] :=
shortest_distance[existing_node, prev_distance], # recursion
route{source: existing_node, distance: route_distance},
distance = prev_distance + route_distance
?[destination, min_distance] := shortest_distance[destination, min_distance]
this query computes the shortest distances from a node to all nodes using the ``min`` aggregation operator.
Concerning stratification, if a rule has aggregations in its head,
then any rule that contains it in an atom must be in a higher stratum,
unless that rule is the same rule (self-recursion) and all aggregations in its head are meet aggregations.
See the dedicated chapter for the aggregation operators available and more details of what "meet" aggregations mean.
----------------------------------
Algorithm application
----------------------------------
The final type of named rule, algorithm applications, is specified by the algorithm name,
take a specified number of named or stored relations as input relations, and have specific options that you provide.
The following query is a calculation of PageRank::
?[] <~ PageRank(:route[], theta: 0.5)
Algorithm applications are distinguished by the curly arrow ``<~``.
In the above example, the relation ``:route`` is the single input relation expected.
Algorithms do not care if an input relation is stored or results from a rule.
Each algorithm expects specific shapes of input relations,
for example, PageRank expects the first two columns of the relation to denote the source and destination
of links in a graph. You must consult the documentation for each algorithm to understand its API.
In algorithm applications, bindings for input relations are usually omitted, but sometimes if they are provided
they are interpreted and used in algorithm-specific ways, for example in the DFS algorithm bindings
can be used to construct an expression for testing the termination condition.
In the example, ``theta`` is a parameter of the algorithm, which is an expression evaluating to a constant.
Each algorithm expects specific parameter types; some parameters have default values and may be omitted.
Each algorithm has a determinate output arity. Usually, you omit the bindings in the rule head, as we do above,
but if you do provide bindings, the arities must match.
In terms of stratification, each algorithm application lives in its own stratum:
it is evaluated after all rules it depends on are completely evaluated,
and all rules depending on the output relation of an algorithm start evaluation only after complete evaluation
of the algorithm. In particular, unlike Horn-clause rules, there is no early termination even if the output relation
is for the entry rule.
-----------------------
Query options
-----------------------
Each query can have query options associated with it::
?[name] := :personnel{name}
:limit 10
:offset 20
In the example, ``:limit`` and ``:offset`` are query options, with familiar meanings from SQL.
All query options start with a single colon ``:``.
Queries options can appear before or after rules, or even sandwiched between rules.
Use this freedom for better readability.
Several query options deal with transactions for the database.
Those will be discussed in the chapter on stored relations and transactions.
Here we explain query options that exclusively affect the query itself.
.. module:: QueryOp
:noindex:
.. function:: :limit <N>
Limit output relation to at most ``<N>`` rows.
If possible, execution will stop as soon as this number of output rows is collected.
.. function:: :offset <N>
Skip the first ``<N>`` rows of the returned relation.
.. function:: :timeout <N>
If the query does not complete within ``<N>`` seconds, abort.
.. function:: :sort <SORT_ARG> (, <SORT_ARG>)*
Sort the output relation before applying other options or returning.
Specify ``<SORT_ARG>`` as they appear in the rule head of the entry, separated by commas.
You can optionally specify the sort direction of each argument by prefixing them with ``+`` or ``-``,
with minus denoting descending sort. As an example, ``:sort -count(employee), dept_name``
sorts by employee count descendingly first, then break ties with department name in ascending alphabetical order.
Note that your entry rule head must contain both ``dept_name`` and ``count(employee)``:
aggregations must be done in Horn-clause rules, not in output sorting. ``:order`` is an alias for ``:sort``.
.. function:: :assert none
With this option, the query returns nothing if the output relation is empty, otherwise execution aborts with an error.
Essential for transactions and triggers.
.. function:: :assert some
With this option, the query returns nothing if the output relation contains at least one row,
otherwise, execution aborts with an error.
Execution of the query stops as soon as the first row is produced if possible.
Essential for transactions and triggers.