You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

348 lines
17 KiB
ReStructuredText

==============
Queries
==============
2 years ago
The Cozo database system is queried using the CozoScript language.
At its core, CozoScript is a `Datalog <https://en.wikipedia.org/wiki/Datalog>`_ dialect
supporting stratified negation and stratified recursive meet-aggregations.
The built-in utilities and algorithms (mainly graph algorithms) further empower
2 years ago
CozoScript for much greater ease of use and much wider applicability.
2 years ago
2 years ago
A query consists of one or many named rules.
2 years ago
Each named rule conceptually represents a relation or a table with rows and columns.
2 years ago
The rule named ``?`` is called the entry to the query,
and its associated relation is returned as the result of the query.
Each named rule has associated with it a rule head, which names the columns of the relation,
and a rule body, which specifies the content of the relation, or how the content should be computed.
2 years ago
2 years ago
In CozoScript, relations (stored relations or relations defined by rules) abide by the *set semantics*,
2 years ago
meaning that even if a rule may compute a row multiple times, it will occur only once in the output.
This is in contradistinction to SQL.
There are two types of named rules in CozoScript:
*inline rules* distinguished by using ``:=`` to connect the head and the body,
and *fixed rules* distinguished by using ``<~`` to connect the head and the body.
You may think that *constant rules* with ``<-`` constitute a third type, written as::
2 years ago
const_rule[a, b, c] <- [[1, 2, 3], [4, 5, 6]]
but this is merely syntax sugar for the fixed rule of the ``Constant`` utility::
const_rule[a, b, c] <~ Constant(data: [[1, 2, 3], [4, 5, 6]])
2 years ago
-----------------
Inline rules
-----------------
An example of an inline rule is::
2 years ago
hc_rule[a, e] := rule_a['constant_string', b], rule_b[b, d, a, e]
The rule body of an inline rule consists of multiple *atoms* joined by commas,
2 years ago
and is interpreted as representing the *conjunction* of these atoms.
^^^^^^^^^^^^^^
Atoms
^^^^^^^^^^^^^^
2 years ago
Atoms come in various flavours.
2 years ago
In the example above::
rule_a['constant_string', b]
2 years ago
is an atom representing a *rule application*: a rule named ``rule_a`` must exist in the same query
2 years ago
and have the correct arity (2 here).
Each row in the named rule is then *unified* with the bindings given as parameters in the square bracket:
here the first column is unified with a constant string, and unification succeeds only when the string
2 years ago
completely matches what is given;
the second column is unified with the *variable* ``b``,
and as the variable is fresh at this point (meaning that it first appears here),
2 years ago
the unification will always succeed and the variable will become *bound*:
2 years ago
from this point take on the value of whatever it was
2 years ago
unified with in the named relation.
When a bound variable is used again later, for example in ``rule_b[b, d, a, e]``, the variable ``b`` was bound
at this point, this unification will only succeed when the unified value is the same as the previously unified value.
In other words, repeated use of the same variable in named rules corresponds to inner joins in relational algebra.
2 years ago
Another flavour of atoms is the *stored relation*. It may be written similarly to a rule application::
2 years ago
:stored_relation[bind1, bind2]
2 years ago
with the colon in front of the stored relation name to distinguish it from rule application.
Written in this way, you must give as many bindings to the stored relation as its arity,
and the bindings proceed by argument positions, which may be cumbersome and error-prone.
So alternatively, you may use the fact that columns of a stored relation are always named and bind by name::
2 years ago
:stored_relation{col1: bind1, col2: bind2}
2 years ago
In this case, you only need to bind as many variables as you use.
2 years ago
If the name you want to give the binding is the same as the name of the column, you may use the shorthand notation:
``:stored_relation{col1}`` is the same as ``:stored_relation{col1: col1}``.
2 years ago
*Expressions* are also atoms, such as::
2 years ago
a > b + 1
Here ``a`` and ``b`` must be bound somewhere else in the rule, and the expression must evaluate to a boolean,
and act as a *filter*: only rows where the expression evaluates to true are kept.
2 years ago
You can also use *unification atoms* to unify explicitly::
a = b + c + d
for such atoms,
whatever appears on the left-hand side must be a single variable and is unified with the right-hand side.
2 years ago
This is different from the equality operator ``==``,
2 years ago
where both sides are merely required to be expressions.
2 years ago
When the left-hand side is a single *bound* variable,
2 years ago
it may be shown that the equality and the unification operators are semantically equivalent.
Another form of *unification atom* is the explicit multi-unification::
a in [x, y, z]
2 years ago
here the variable on the left-hand side of ``in`` is unified with each item on the right-hand side in turn,
which in turn implies that the right-hand side must evaluate to a list
2 years ago
(but may be represented by a single variable or a function call).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Head and returned relation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Atoms, as explained above, corresponds to either relations (or their projections) or filters in relational algebra.
2 years ago
Linked by commas, they, therefore, represent a joined relation, with named columns.
2 years ago
The *head* of the rule, which in the simplest case is just a list of variables,
then defines whichever columns to keep, and their order in the output relation.
Each variable in the head must be bound in the body, this is one of the *safety rules* of Datalog.
Not all variables appearing in the body need to appear in the head.
2 years ago
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Multiple definitions and disjunction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For inline rules only, multiple rule definitions may share the same name,
2 years ago
with the requirement that the arity of the head in each definition must match.
The returned relation is then the *disjunction* of the multiple definitions,
which correspond to *union* in SQL.
2 years ago
*Intersect* in SQL can be written in CozoScript into a single rule since commas denote conjunction.
2 years ago
In complicated situations, you may instead write disjunctions in a single rule with the explicit ``or`` operator::
rule1[a, b] := rule2[a] or rule3[a], rule4[a, b]
For completeness, there is also an explicit ``and`` operator, but it is semantically identical to the comma,
except that
2 years ago
it has higher operator precedence than ``or``, which in turn has higher operator precedence than the comma.
During evaluation, each rule is canonicalized into
`disjunction normal form <https://en.wikipedia.org/wiki/Disjunctive_normal_form>`_
2 years ago
and each clause of the outmost disjunction is treated as a separate rule.
2 years ago
The consequence is that the safety rule may be violated
even though textually every variable in the head occurs in the body.
2 years ago
As an example::
rule[a, b] := rule1[a] or rule2[b]
is a violation of the safety rule since it is rewritten into two rules, each of which is missing a different binding.
^^^^^^^^^^^^^^^^
Negation
^^^^^^^^^^^^^^^^
Atoms in inline rules may be *negated* by putting ``not`` in front of them, as in::
2 years ago
not rule1[a, b]
2 years ago
When negating rule applications and stored relations,
at least one binding must be bound somewhere else in the rule in a non-negated context:
this is another safety rule of Datalog, and it ensures that the outputs of rules are always finite.
2 years ago
The unbound bindings in negated rules remain unbound: negation cannot introduce bound bindings to be used in the head.
2 years ago
Negated expressions act as negative filters,
2 years ago
which is semantically equivalent to putting ``!`` in front of the expression.
2 years ago
Since negation does not introduce new bindings,
unifications and multi-unifications are converted to equivalent expressions and then negated.
2 years ago
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Recursion and stratification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The body of an inline rule may contain rule applications of itself,
and multiple inline rules may apply each other recursively.
2 years ago
The only exception is the entry rule ``?``, which cannot be referred to by other rules.
2 years ago
Self and mutual references allow recursion to be defined easily. To guard against semantically pathological cases,
2 years ago
recursion cannot occur in negated positions: the Russell-style rule ``r[a] := not r[a]`` is not allowed.
This requirement creates an ordering of the rules, since
2 years ago
negated rules must evaluate to completion before rules that apply them can start evaluation:
2 years ago
this is called *stratification* of the rules.
2 years ago
In cases where a total ordering cannot be defined since there exists a loop in the ordering
2 years ago
required by negation, the query is then deemed unstratifiable and Cozo will refuse to execute it.
2 years ago
Note that since CozoScript allows unifying fresh variables, you can still easily write programs that produce
2 years ago
infinite relations and hence cannot complete through recursion, but that are still accepted by the database.
One of the simplest examples is::
r[a] := a = 0
r[a] := r[b], a = b + 1
?[a] := r[a]
2 years ago
It is up to the user to ensure that such programs are not submitted to the database,
2 years ago
as it is not even in principle possible for the database to rule out such cases without wrongly rejecting valid queries.
2 years ago
If you accidentally submitted one, you can refer to the system ops section for how to terminate long-running queries.
2 years ago
Or you can give a timeout for the query when you submit.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Aggregation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
CozoScript supports aggregations, as does SQL, which provides a very useful extension to pure relational algebra.
In CozoScript, aggregations are specified for inline rules by applying aggregation operators to variables
in the rule head, as in::
?[department, count(employee)] := :personnel{department, employee}
2 years ago
here we have used the ``count`` operator familiar to all SQL users.
The semantics is that any variables in the head without aggregation operators are treated as *grouping variables*,
similar to what appears in a ``GROUP BY`` clause in SQL, and the aggregation is applied using the grouping variables
as keys. If you do not specify any grouping variables, then you get at most one row as the return value.
2 years ago
As we now understand, CozoScript follows relational algebra with set semantics.
With the introduction of aggregations, the situation is a little bit more complicated,
as aggregations are applied to the relation resulting from the body of the rule using bag semantics,
2 years ago
and the resulting relation of the rule, after aggregations are applied, follows set semantics.
The reason for this complication is that if aggregations are applied with set semantics, then the following query::
?[count(employee)] := :personnel{employee}
2 years ago
does not do what you expect: it either returns a row with a single value ``1`` if there are any matching rows,
or it returns nothing at all if the stored relation ``:personnel`` is empty.
Though semantically sound, this behaviour is not useful at all.
2 years ago
So for aggregations, we opt for bag semantics, and the query does what one expects.
If a rule has several definitions, they must have identical aggregations applied in the same positions,
2 years ago
otherwise, the query will be rejected.
The reason is that in complicated situations the semantics is ambiguous and counter-intuitive if we do allow it.
2 years ago
Existing database systems do not usually allow aggregations through recursion,
since in many cases, it is difficult to give useful semantics to such queries.
In CozoScript we allow aggregations for self-recursion for a limited subset of aggregation operators,
the so-called *semi-lattice aggregations*, such as the following example shows::
shortest_distance[destination, min(distance)] := route{source: 'A', destination, distance}
2 years ago
shortest_distance[destination, min(distance)] :=
shortest_distance[existing_node, prev_distance], # recursion
route{source: existing_node, distance: route_distance},
distance = prev_distance + route_distance
?[destination, min_distance] := shortest_distance[destination, min_distance]
this query computes the shortest distances from a node to all nodes using the ``min`` aggregation operator.
2 years ago
Concerning stratification, if a rule has aggregations in its head,
then any rule that contains it in an atom must be in a higher stratum,
unless that rule is the same rule (self-recursion) and all aggregations in its head are semi-lattice aggregations.
Consult the dedicated chapter for the aggregation operators available.
----------------------------------
Fixed rules
----------------------------------
The body of a fixed rule starts with the name of the utility or algorithm being applied,
then takes a specified number of named or stored relations as its *input relations*,
followed by *options* that you provide.
The following query is a calculation of PageRank::
?[] <~ PageRank(:route[], theta: 0.5)
In the above example, the relation ``:route`` is the single input relation expected.
Input relations may be stored relations or relations resulting from rules.
Each utility/algorithm expects specific shapes of input relations,
for example, PageRank expects the first two columns of the relation to denote the source and destination
of links in a graph. You must consult the documentation for each utility/algorithm to understand its API.
In fixed rules, bindings for input relations are usually omitted, but sometimes if they are provided
they are interpreted and used in case-specific ways, for example in the DFS algorithm bindings
2 years ago
can be used to construct an expression for testing the termination condition.
In the example given above, ``theta`` is an option of the algorithm,
which is required by the API to be an expression evaluating to a constant.
Each utility/algorithm expects specific types for the options;
some options have default values and may be omitted.
Each fixed rule has a determinate output arity,
deduced from the specific utility/algorithm being applied and the options given.
Usually, you omit the bindings in the rule head, as we do above,
but if you do provide bindings, the arities must match.
In terms of stratification, each fixed rule lives in its own stratum:
2 years ago
it is evaluated after all rules it depends on are completely evaluated,
and all rules depending on the output relation of a fiex rule start evaluation only after complete evaluation
of the fixed rule.
In particular, unlike inline rules, there is no early termination even if the output relation
is for the entry rule.
-----------------------
2 years ago
Query options
-----------------------
Each query can have query options associated with it::
?[name] := :personnel{name}
:limit 10
:offset 20
2 years ago
In the example, ``:limit`` and ``:offset`` are query options, with familiar meanings from SQL.
All query options start with a single colon ``:``.
Queries options can appear before or after rules, or even sandwiched between rules.
2 years ago
Use this freedom for better readability.
2 years ago
Several query options deal with transactions for the database.
Those will be discussed in the chapter on stored relations and transactions.
Here we explain query options that exclusively affect the query itself.
.. module:: QueryOp
:noindex:
.. function:: :limit <N>
2 years ago
Limit output relation to at most ``<N>`` rows.
If possible, execution will stop as soon as this number of output rows is collected.
.. function:: :offset <N>
Skip the first ``<N>`` rows of the returned relation.
.. function:: :timeout <N>
Abort if the query does not complete within ``<N>`` seconds.
Seconds may be specified as an expression so that random timeouts are possible.
.. function:: :sleep <N>
If specified, the query will wait for ``<N>`` seconds after completion,
before committing or proceeding to the next query.
Seconds may be specified as an expression so that random timeouts are possible.
This is useful for deliberately interleaving concurrent queries to test for complex logic.
.. function:: :sort <SORT_ARG> (, <SORT_ARG>)*
2 years ago
Sort the output relation before applying other options or returning.
Specify ``<SORT_ARG>`` as they appear in the rule head of the entry, separated by commas.
You can optionally specify the sort direction of each argument by prefixing them with ``+`` or ``-``,
2 years ago
with minus denoting descending sort. As an example, ``:sort -count(employee), dept_name``
sorts by employee count descendingly first, then break ties with department name in ascending alphabetical order.
2 years ago
Note that your entry rule head must contain both ``dept_name`` and ``count(employee)``:
aggregations must be done in inline rules, not in output sorting. ``:order`` is an alias for ``:sort``.
.. function:: :assert none
With this option, the query returns nothing if the output relation is empty, otherwise execution aborts with an error.
Essential for transactions and triggers.
.. function:: :assert some
2 years ago
With this option, the query returns nothing if the output relation contains at least one row,
otherwise, execution aborts with an error.
Execution of the query stops as soon as the first row is produced if possible.
Essential for transactions and triggers.