update manual

main
Ziyang Hu 2 years ago
parent 5e03017c2a
commit 5aee2fb56e

@ -3,11 +3,14 @@ Aggregations
==============
Aggregations in Cozo can be thought of as a function that acts on a string of values and produces a single value (the aggregate). Due to Datalog semantics, the stream is never empty.
Aggregations in Cozo can be thought of as a function that acts on a stream of values
and produces a single value (the aggregate).
There are two kinds of aggregations in Cozo, *ordinary aggregations* and *meet aggregations*. They are implemented differently in Cozo, with meet aggregations generally faster and more powerful (e.g. only meet aggregations can be recursive).
There are two kinds of aggregations in Cozo, *ordinary aggregations* and *semi-lattice aggregations*.
They are implemented differently in Cozo, with semi-lattice aggregations generally faster and more powerful
(only the latter can be used recursively).
The power of meet aggregations derive from the additional properties they satisfy by forming a `semilattice <https://en.wikipedia.org/wiki/Semilattice>`_:
The power of semi-lattice aggregations derive from the additional properties they satisfy: a `semilattice <https://en.wikipedia.org/wiki/Semilattice>`_:
idempotency
the aggregate of a single value ``a`` is ``a`` itself,
@ -16,13 +19,11 @@ The power of meet aggregations derive from the additional properties they satisf
associativity
it is immaterial where we put the parentheses in an aggregate application.
Meet aggregations can be used as ordinary ones, but the reverse is impossible.
------------------------------------
Semi-lattice aggregations
------------------------------------
------------------
Meet aggregations
------------------
.. module:: Aggr.Meet
.. module:: Aggr.SemiLattice
:noindex:
.. function:: min(x)
@ -51,11 +52,13 @@ Meet aggregations
.. function:: choice(var)
Non-deterministically chooses one of the values of ``var`` as the aggregate. It simply chooses the first value it meets (the order that it meets values should be considered non-deterministic).
Non-deterministically chooses one of the values of ``var`` as the aggregate.
It simply chooses the first value it meets (the order that it meets values is non-deterministic).
.. function:: choice_last(var)
Non-deterministically chooses one of the values of ``var`` as the aggregate. It simply chooses the last value it meets.
Non-deterministically chooses one of the values of ``var`` as the aggregate.
It simply chooses the last value it meets.
.. function:: min_cost([data, cost])
@ -102,7 +105,8 @@ Ordinary aggregations
.. function:: group_count(var)
Count the occurrence of unique values of ``var``, putting the result into a list of lists, e.g. when applied to ``'a'``, ``'b'``, ``'c'``, ``'c'``, ``'a'``, ``'c'``, the results is ``[['a', 2], ['b', 1], ['c', 3]]``.
Count the occurrence of unique values of ``var``, putting the result into a list of lists,
e.g. when applied to ``'a'``, ``'b'``, ``'c'``, ``'c'``, ``'a'``, ``'c'``, the results is ``[['a', 2], ['b', 1], ['c', 3]]``.
.. function:: bit_xor(var)
@ -110,15 +114,19 @@ Ordinary aggregations
.. function:: latest_by([data, time])
The argument should be a list of two elements and this aggregation returns the ``data`` of the maximum ``cost``. This is very similar to ``min_cost``, the differences being that maximum instead of minimum is used, only the data itself is returned, and the aggregation is deliberately note a meet aggregation. Intended to be used in timestamped audit trails.
The argument should be a list of two elements and this aggregation returns the ``data`` of the maximum ``cost``.
This is very similar to ``min_cost``, the differences being that maximum instead of minimum is used,
only the data itself is returned, and the aggregation is deliberately not a semi-lattice aggregation. Intended to be used in timestamped audit trails.
.. function:: choice_rand(var)
Non-deterministically chooses one of the values of ``var`` as the aggregate.
Each value the aggregation encounters has the same probability of being chosen.
This version of ``choice`` is not a meet aggregation
since it is impossible to satisfy the uniform sampling requirement while maintaining no state,
which is an implementation restriction unlikely to be lifted.
.. NOTE::
This version of ``choice`` is not a semi-lattice aggregation
since it is impossible to satisfy the uniform sampling requirement while maintaining no state,
which is an implementation restriction unlikely to be lifted.
^^^^^^^^^^^^^^^^^^^^^^^^^
Statistical aggregations

@ -2,10 +2,7 @@
Utilities and algorithms
==============================
Fixed rules in CozoScript apply utilities and/or algorithms.
The purpose of the native, built-in utilities and algorithms in Cozo is to
enable easy computation of results that would require either queries too awkward to express in pure Datalog,
or time or space requirements that are unreasonable if implemented in the interpreted queries framework.
Fixed rules in CozoScript apply utilities or algorithms.
.. module:: Algo
:noindex:
@ -17,7 +14,7 @@ Utilities
.. function:: Constant(data: [...])
Returns a constant relation containing the data passed in. The constant rule ``?[] <- ...`` is actually
Returns a constant relation containing the data passed in. The constant rule ``?[] <- ...`` is
syntax sugar for ``?[] <~ Constant(data: ...)``.
:param data: A list of lists, representing the rows of the returned relation.

@ -1,12 +1,12 @@
==============
Datatypes
Types
==============
--------------
Value types
Runtime types
--------------
A runtime value in Cozo can be of the following *value-types*:
Values in Cozo have the following *runtime types*:
* ``Null``
* ``Bool``
@ -22,55 +22,57 @@ A runtime value in Cozo can be of the following *value-types*:
Cozo sorts values according to the above order, e.g. ``null`` is smaller than ``true``, which is in turn smaller than the list ``[]``.
Within each type values are *compared* according to logic custom to each type:
Within each type values are *compared* according to:
* ``false < true``;
* ``-1 == -1.0 < 0 == 0.0 < 0.5 == 0.5 < 1 == 1.0`` (however, see the caveat below);
* ``-1 == -1.0 < 0 == 0.0 < 0.5 == 0.5 < 1 == 1.0``;
* Lists are ordered lexicographically by their elements;
* Bytes are compared lexicographically;
* Strings are ordered lexicographically by their UTF-8 byte representations.
* UUIDs are sorted in a way that UUIDv1 with similar timestamps are near each other. This is to improve data locality and should be considered an implementation detail. Depending on the order of UUID in your application is not recommended.
* Strings are compared lexicographically by their UTF-8 byte representations;
* UUIDs are sorted in a way that UUIDv1 with similar timestamps are near each other.
This is to improve data locality and should be considered an implementation detail.
Depending on the order of UUID in your application is not recommended.
.. WARNING::
Because there are two internal number types ``Int`` and ``Float`` under the umbrella type ``Number``, sorting numbers can be more complex than anticipated.
When sorting, the integer always comes before the equivalent float. For example, ``1.0 == 1``, ``1.0 >= 1`` and ``1.0 <= 1`` all evaluate to true, but when sorting ``1`` and ``1.0`` are two _different_ values and ``1`` is placed before ``1.0``.
This may create problems when applying aggregations since if a grouping key contains both ``1.0`` and ``1``, they are treated as separate group headings. In such cases, it may help to use explicit coercion ``to_float`` or ``round`` to coerce all sorted values to the same type.
``1 == 1.0`` evaluates to ``true``, but ``1`` and ``1.0`` are distinct values,
meaning that a relation can contain both as keys according to set semantics.
This is especially confusing when using JavaScript, which converts all numbers to float,
and python, which does not show a difference between the two when printing.
Using floating point numbers in keys is not recommended if the rows are accessed by these keys
(instead of accessed by iteration).
----------------
Value literals
Literals
----------------
The standard notations ``null`` for the type ``Null``, ``false`` and ``true`` for the type ``Bool`` are followed.
The standard notations ``null`` for the type ``Null``, ``false`` and ``true`` for the type ``Bool`` are used.
Besides the usual decimal notation for signed integers,
you can prefix a number with ``0x`` or ``-0x`` for hexadecimal notation,
with ``0o`` or ``-0o`` for octal notation,
or with ``0b`` or ``-0b`` for binary notation.
Floating point numbers include the decimal dot, which may be trailing,
you can prefix a number with ``0x`` or ``-0x`` for hexadecimal representation,
with ``0o`` or ``-0o`` for octal,
or with ``0b`` or ``-0b`` for binary.
Floating point numbers include the decimal dot (may be trailing),
and may be in scientific notation.
All numbers may include underscores ``_`` in their representation for clarity.
For example, ``299_792_458`` is the speed of light in meters per second.
Strings can be typed in the same way as they do in JSON between double quotes ``""``,
Strings can be typed in the same way as they do in JSON using double quotes ``""``,
with the same escape rules.
You can also use single quotes ``''`` in which case the roles of the double quote and single quote are switched.
In addition, there is a raw string notation::
You can also use single quotes ``''`` in which case the roles of double quotes and single quotes are switched.
There is also a "raw string" notation::
r___"I'm a raw string with "quotes"!"___
___"I'm a raw string"___
A raw string starts with the letter ``r`` followed by an arbitrary number of underscores, and then a double quote.
A raw string starts with an arbitrary number of underscores, and then a double quote.
It terminates when followed by a double quote and the same number of underscores.
Everything in between is interpreted exactly as typed, including any newlines.
By varying the number of underscores, you can represent any string without quoting.
There is no literal representation for ``Bytes`` or ``Uuid`` due to restrictions placed by JSON.
You must pass in its Base64 encoding for bytes, or hyphened strings for UUIDs,
and use the appropriate functions to decode it.
If you are just inserting data into a stored relation with a column specified to contain bytes or UUIDs,
auto-coercion will kick in.
There is no literal representation for ``Bytes`` or ``Uuid``.
Use the appropriate functions to create them.
If you are inserting data into a stored relation with a column specified to contain bytes or UUIDs,
auto-coercion will kick in and use ``decode_base64`` and ``to_uuid`` for conversion.
Lists are items enclosed between square brackets ``[]``, separated by commas.
A trailing comma is allowed after the last item.

@ -2,57 +2,63 @@
Query execution
====================================
Usually, in a database,
how queries are executed is usually considered an implementation detail
hidden behind an abstraction barrier, which normal users need not care about.
The idea is that databases will take advantage of this abstraction barrier
by using query optimizers to choose the best query plan,
Databases often consider how queries are executed an implementation detail
hidden behind an abstraction barrier that users need not care about,
so that databases can utilize query optimizers to choose the best query execution plan
regardless of how the query was originally written.
As everyone knows, however, this abstraction barrier is leaky,
since bad query execution plans invariably occur and hurt performance.
The problem is especially severe when dealing with graphs,
since graph traversals generally use a lot more joins than non-graph queries,
and the reliability of even the best query optimizer
decreases exponentially with the number of joins.
In Cozo we take the pragmatic approach and assume that the user eventually
knows (or should know) what is the best way to query their data.
This is certainly true for those developers who spend hours
"coercing" the query optimizers to use a query plan that the user intends,
sometimes in rather convoluted ways.
In Cozo, no coercion is necessary since the query execution is completely
determined by how the query is written:
there is no stats-based query planning involved.
In our experience, this saves quite a lot of developer time,
since developers eventually learn how to write efficient queries naturally,
and after they do, they no longer have to deal with endless "query de-optimizations".
This abstraction barrier is leaky, however,
since bad query execution plans invariably occur,
and users need to "reach behind the curtain" to fix performance problems,
which is a difficult and tiring task.
The problem becomes more severe the more joins a query contains,
and graph queries tend to contain a large number of joins.
So in Cozo we take the pragmatic approach and make query execution deterministic
and easy to tell from how the query was written.
The flip side is that we demand the user to
know what is the best way to store their data,
which is in general less demanding than coercing the query optimizer.
Then, armed with knowledge of this chapter, writing efficient queries is easy.
--------------------------------------
Stratification
Disjunctive normal form
--------------------------------------
As discussed in the chapter on queries, Cozo sees a query as a set of named rules.
Fixed rules are left as they are,
and inline rules are converted into disjunctive normal forms.
After conversion, all inline rules consist of conjunction of atoms only,
and negation only occurs for the leaf atoms.
Evaluation starts by canonicalizing inline rules into
`disjunction normal form <https://en.wikipedia.org/wiki/Disjunctive_normal_form>`_,
i.e., a disjunction of conjunctions, with any negation pushed to the innermost level.
Each clause of the outmost disjunction is then treated as a separate rule.
The consequence is that the safety rule may be violated
even though textually every variable in the head occurs in the body.
As an example::
rule[a, b] := rule1[a] or rule2[b]
is a violation of the safety rule since it is rewritten into two rules, each of which is missing a different binding.
--------------------------------------
Stratification
--------------------------------------
The next step towards executing the query is *stratifying* the rules.
Stratification begins by making a graph of the named rules,
The next step in the processing is *stratification*.
It begins by making a graph of the named rules,
with the rules themselves as nodes,
and a link is added between two nodes when one of the rules applies the other.
This application is through atoms for inline rules, and input relations for fixed rules.
Now some of the links are labelled *stratifying*:
when an inline rule applies another rule through negation,
when an inline rule applies another inline rule that contains aggregations,
when an inline rule applies itself and it has non-semi-lattice,
when an inline rule applies another rule which is a fixed rule,
or when a fixed rule has another rule as an input relation.
The strongly connected components of this graph are then determined and tested,
Next, some of the links are labelled *stratifying*:
* when an inline rule applies another rule through negation,
* when an inline rule applies another inline rule that contains aggregations,
* when an inline rule applies itself and it has non-semi-lattice,
* when an inline rule applies another rule which is a fixed rule,
* when a fixed rule has another rule as an input relation.
The strongly connected components of the graph of rules are then determined and tested,
and if it found that some strongly connected component contains a stratifying link,
the graph is deemed *unstratifiable*, and the execution aborts.
Otherwise, Cozo will topologically sort the strongly connected components to
determine a *stratification* of the rules:
determine the strata of the rules:
rules within the same stratum are logically executed together,
and no two rules within the same stratum can have a stratifying link between them.
In this process,
@ -65,9 +71,10 @@ You can see the stratum number assigned to rules by using the ``::explain`` syst
--------------------------------------
Magic set rewrites
--------------------------------------
Within each stratum, the input rules are rewritten using a technique called magic sets.
In intuitive terms, this rewriting is to ensure that the query execution does not
waste time calculating results that are then simply discarded.
Within each stratum, the input rules are rewritten using the technique of *magic sets*.
This rewriting ensures that the query execution does not
waste time calculating results that are later simply discarded.
As an example, consider::
reachable[a, b] := link[a, n]
@ -76,41 +83,41 @@ As an example, consider::
Without magic set rewrites, the whole ``reachable`` relation is generated first,
then most of them are thrown away, keeping only those starting from ``'A'``.
Magic set avoids this problem. How the rewrite proceeds is rather technical,
but you can see the results in the output of ``::explain``.
The rewritten query is guaranteed to yield the same relation for ``?``,
Magic set rewriting avoids this problem.
You can see the result of the rewriting using ``::explain``.
The rewritten query is guaranteed to yield the same relation for ``?``,
and will in general yield fewer intermediate rows.
Currently, the rewrite applies only to inline rules without aggregations.
So for the moment being, you may need to manually constrain some of your rules.
The rewrite currently only applies to inline rules without aggregations.
--------------------------------------
Semi-naïve evaluation
--------------------------------------
Now each stratum contains either a single fixed rule or a set of inline rules.
The single fixed rule case is easy: just run the specific implementation of the rule.
In the case of the inline rules, each of the rules is assigned an output relation.
The single fixed rules are executed by running their specific implementations.
For the inline rules, each of them is assigned an output relation.
Assuming we know how to evaluate each rule given all the relations it depends on,
the semi-naïve algorithm can now be applied to the rules to yield all output rows.
The semi-naïve algorithm is a bottom-up evaluation strategy, meaning that it tries to deduce
all facts from a set of given facts.
By contrast, top-down strategies start with stated goals and try to find proof for the goals.
Bottom-up strategies have many advantages over top-down ones when the whole output of each rule
is needed, but may waste time generating unused facts if only some of the output is kept.
Magic set rewrites are introduced to eliminate precisely this weakness.
.. NOTE::
By contrast, top-down strategies start with stated goals and try to find proof for the goals.
Bottom-up strategies have many advantages over top-down ones when the whole output of each rule
is needed, but may waste time generating unused facts if only some of the output is kept.
Magic set rewrites are introduced to eliminate precisely this weakness.
---------------------------------------
Ordering of atoms
---------------------------------------
Now we discuss how a single definition of an inline rule is evaluated.
We know from the query chapter that the body of the rule contains atoms,
and after conversion to disjunctive normal forms, all atoms are linked by conjunction,
and each atom can only be one of the following:
The compiler reorders the atoms in the body of the inline rules, and then
the atoms are evaluated.
After conversion to disjunctive normal forms,
each atom can only be one of the following:
* an explicit unification,
* applying a rule or a stored relation,
@ -118,57 +125,56 @@ and each atom can only be one of the following:
* a negation of an application.
The first two cases may introduce fresh bindings, whereas the last two cannot.
The atoms are then reordered: all atoms that introduce new bindings stay where they are,
The reordering make all atoms that introduce new bindings stay where they are,
whereas all atoms that do not introduce new bindings are moved to the earliest possible place
where all their bindings are bound. In fact,
all atoms that introduce bindings correspond to
where all their bindings are bound.
All atoms that introduce bindings correspond to
joining with a pre-existing relation followed by projections
in relational algebra, and all atoms that do not correspond to filters.
The idea is to apply filters as early as possible
to minimize the number of rows before joining with the next relation.
By applying filters as early as possible,
we minimize the number of rows before joining them with the next relation.
This procedure is completely deterministic.
When writing the body of rules, we therefore should aim to minimize the total number of rows generated.
A strategy that works almost in all cases is to put the most restrictive atoms which generate new bindings first,
as this can make the left relation in each join small.
When writing the body of rules, we should aim to minimize the total number of rows generated.
A strategy that works almost in all cases is to put the most restrictive atoms which generate new bindings first.
---------------------------------------
Relations as indices
Evaluating atoms
---------------------------------------
Next, we need to understand how a single atom which generates new bindings is processed.
We now explain how a single atom which generates new bindings is processed.
For the case of unification, it is simple: the right-hand side of the unification,
which is an expression with all variables bound, is simply evaluated, and the result is joined
For unifications, the right-hand side, an expression with all variables bound,
is simply evaluated, and the result is joined
to the current relation (as in a ``map-cat`` operation in functional languages).
For the case of the application of relations,
the first thing to understand is that all relations in Cozo are conceptually trees.
All the bindings of relations generated by inline or fixed rules,
and the keys of stored relations, act as a composite key for the tree.
The access complexity is therefore determined by whether a key component is bound.
Rules or stored relations are conceptually trees, with composite keys sorted lexicographically.
The complexity of their applications in atoms
is therefore determined by whether the bound variables and constants in the application bindings form a *key prefix*.
For example, the following application::
a_rule['A', 'B', c]
with ``c`` unbound is very efficient, since this corresponds to a prefix scan in the tree with the key prefix ``['A', 'B']``,
with ``c`` unbound, is very efficient, since this corresponds to a prefix scan in the tree with the key prefix ``['A', 'B']``,
whereas the following application::
a_rule[a, 'B', 'C']
where ``a`` is unbound is very expensive, since we must do a full relation scan.
On the other hand, if ``a`` is bound, then this is only a logarithmic-time check.
where ``a`` is unbound, is very expensive, since we must do a full scan.
On the other hand, if ``a`` is bound, then this is only a logarithmic-time existence check.
For stored relations, you need to check its schema for the order of keys to deduce the complexity.
The system op ``::explain`` may also give you some information.
Rows are generated in a streaming fashion,
meaning that relation joins proceed as soon as one row is available,
and do not wait until the whole relation is generated.
---------------------------------------
Early stopping
---------------------------------------
Within each stratum, rows are generated in a streaming fashion.
For the entry rule ``?``, if ``:limit`` is specified as a query option,
For the entry rule ``?``, if ``:limit`` is specified as a query option,
a counter is used to monitor how many valid rows are already generated.
If enough rows are generated, the query stops.
Note that this only works when the entry rule is inline,
and when you are *not* specifying ``:order``.
This only works when the entry rule is inline
and you do not specify ``:order``.

@ -2,9 +2,9 @@
Functions
=========
Functions can be used in expressions in Cozo. All function arguments in Cozo are immutable. All functions except those having names starting with ``rand_`` are deterministic.
Functions can be used to build expressions.
Internally, all function arguments are partially evaluated before binding variables to input tuples. For example, the regular expression in ``regex_matches(var, '[a-zA-Z]+')`` will only be compiled once during the execution of the query, instead of being repeatedly compiled for every input tuple.
In the following, all functions except those having names starting with ``rand_`` are deterministic.
------------------------
Equality and Comparisons
@ -15,7 +15,7 @@ Equality and Comparisons
.. function:: eq(x, y)
Equality comparison. The operator form is ``x == y`` or ``x = y``. The two arguments of the equality can be of different types, in which case the result is ``false``.
Equality comparison. The operator form is ``x == y``. The two arguments of the equality can be of different types, in which case the result is ``false``.
.. function:: neq(x, y)
@ -37,10 +37,9 @@ Equality and Comparisons
Equivalent to ``x <= y``
.. NOTE::
The four comparison operators can only compare values of the same value type. Integers and floats are of the same type ``Number``.
The four comparison operators can only compare values of the same runtime type. Integers and floats are of the same type ``Number``.
.. function:: max(x, ...)
@ -150,15 +149,15 @@ Mathematics
.. function:: sin(x)
The sine trigonometric Func.
The sine trigonometric function.
.. function:: cos(x)
The cosine trigonometric Func.
The cosine trigonometric function.
.. function:: tan(x)
The tangent trigonometric Func.
The tangent trigonometric function.
.. function:: asin(x)
@ -210,15 +209,24 @@ Mathematics
.. function:: haversine(a_lat, a_lon, b_lat, b_lon)
Computes with the `haversine formula <https://en.wikipedia.org/wiki/Haversine_formula>`_ the angle measured in radians between two points ``a`` and ``b`` on a sphere specified by their latitudes and longitudes. The inputs are in radians. You probably want the next function since most maps measure angles in radians.
Computes with the `haversine formula <https://en.wikipedia.org/wiki/Haversine_formula>`_
the angle measured in radians between two points ``a`` and ``b`` on a sphere
specified by their latitudes and longitudes. The inputs are in radians.
You probably want the next function when you are dealing with maps,
since most maps measure angles in degrees instead of radians.
.. function:: haversine_deg_input(a_lat, a_lon, b_lat, b_lon)
Same as the previous function, but the inputs are in degrees instead of radians. The return value is still in radians. If you want the approximate distance measured on the surface of the earth instead of the angle between two points, multiply the result by the radius of the earth, which is about ``6371`` kilometres, ``3959`` miles, or ``3440`` nautical miles.
Same as the previous function, but the inputs are in degrees instead of radians.
The return value is still in radians.
If you want the approximate distance measured on the surface of the earth instead of the angle between two points,
multiply the result by the radius of the earth,
which is about ``6371`` kilometres, ``3959`` miles, or ``3440`` nautical miles.
.. WARNING::
.. NOTE::
The haversine formula, when applied to the surface of the earth, which is not a perfect sphere, can result in an error of less than one percent.
The haversine formula, when applied to the surface of the earth, which is not a perfect sphere, can result in an error of less than one percent.
------------------------
String functions
@ -234,9 +242,11 @@ String functions
Can also be applied to a list or a byte array.
.. WARNING::
.. WARNING::
``length(str)`` does not return the number of bytes of the string representation. Also, what is returned depends on the normalization of the string. So if such details are important, apply ``unicode_normalize`` before ``length``.
``length(str)`` does not return the number of bytes of the string representation.
Also, what is returned depends on the normalization of the string.
So if such details are important, apply ``unicode_normalize`` before ``length``.
.. function:: concat(x, ...)
@ -273,9 +283,10 @@ String functions
Tests if ``x`` starts with ``y``.
.. TIP::
.. TIP::
``starts_with(var, str)`` is prefered over equivalent (e.g. regex) conditions, since the compiler may more easily compile the clause into a range scan.
``starts_with(var, str)`` is preferred over equivalent (e.g. regex) conditions,
since the compiler may more easily compile the clause into a range scan.
.. function:: ends_with(x, y)
@ -283,7 +294,8 @@ String functions
.. function:: unicode_normalize(str, norm)
Converts ``str`` to the `normalization <https://en.wikipedia.org/wiki/Unicode_equivalence>`_ specified by ``norm``. The valid values of ``norm`` are ``'nfc'``, ``'nfd'``, ``'nfkc'`` and ``'nfkd'``.
Converts ``str`` to the `normalization <https://en.wikipedia.org/wiki/Unicode_equivalence>`_ specified by ``norm``.
The valid values of ``norm`` are ``'nfc'``, ``'nfd'``, ``'nfkc'`` and ``'nfkd'``.
.. function:: chars(str)
@ -293,10 +305,10 @@ String functions
Combines the strings in ``list`` into a big string. In a sense, it is the inverse function of ``chars``.
.. WARNING::
If you want substring slices, indexing strings, etc., first convert the string to a list with ``chars``, do the manipulation on the list, and then recombine with ``from_substring``. Hopefully, the omission of functions doing such things directly can make people more aware of the complexities involved in manipulating strings (and getting the *correct* result).
.. WARNING::
If you want substring slices, indexing strings, etc., first convert the string to a list with ``chars``,
do the manipulation on the list, and then recombine with ``from_substring``.
--------------------------
List functions
@ -323,11 +335,11 @@ List functions
.. function:: get(l, n)
Returns the element at index ``n`` in the list ``l``. This function will raise an error if the access is out of bounds. Indices start with 0.
Returns the element at index ``n`` in the list ``l``. Raises an error if the access is out of bounds. Indices start with 0.
.. function:: maybe_get(l, n)
Returns the element at index ``n`` in the list ``l``. This function will return ``null`` if the access is out of bounds. Indices start with 0.
Returns the element at index ``n`` in the list ``l``. Returns ``null`` if the access is out of bounds. Indices start with 0.
.. function:: length(list)
@ -337,7 +349,9 @@ List functions
.. function:: slice(l, start, end)
Returns the slice of list between the index ``start`` (inclusive) and ``end`` (exclusive). Negative numbers may be used, which is interpreted as counting from the end of the list. E.g. ``slice([1, 2, 3, 4], 1, 3) == [2, 3]``, ``slice([1, 2, 3, 4], 1, -1) == [2, 3]``.
Returns the slice of list between the index ``start`` (inclusive) and ``end`` (exclusive).
Negative numbers may be used, which is interpreted as counting from the end of the list.
E.g. ``slice([1, 2, 3, 4], 1, 3) == [2, 3]``, ``slice([1, 2, 3, 4], 1, -1) == [2, 3]``.
.. function:: concat(x, ...)
@ -428,8 +442,8 @@ Binary functions
Encodes the byte array ``b`` into the `Base64 <https://en.wikipedia.org/wiki/Base64>`_-encoded string.
.. NOTE::
``encode_base64`` is automatically applied when output to JSON since JSON cannot represent bytes natively.
.. NOTE::
``encode_base64`` is automatically applied when output to JSON since JSON cannot represent bytes natively.
.. function:: decode_base64(str)

@ -2,31 +2,31 @@
Queries
==============
The Cozo database system is queried using the CozoScript language.
At its core, CozoScript is a `Datalog <https://en.wikipedia.org/wiki/Datalog>`_ dialect
supporting stratified negation and stratified recursive meet-aggregations.
The built-in utilities and algorithms (mainly graph algorithms) further empower
CozoScript for much greater ease of use and much wider applicability.
A query consists of one or many named rules.
Each named rule conceptually represents a relation or a table with rows and columns.
The rule named ``?`` is called the entry to the query,
and its associated relation is returned as the result of the query.
Each named rule has associated with it a rule head, which names the columns of the relation,
CozoScript, a `Datalog <https://en.wikipedia.org/wiki/Datalog>`_ dialect, is the query language of Cozo.
A CozoScript query consists of one or many named rules.
Each named rule represents a *relation*, i.e. collection of data divided into rows and columns.
The rule named ``?`` is the *entry* to the query,
and the relation it represents is the result of the query.
Each named rule has a rule head, which corresponds to the columns of the relation,
and a rule body, which specifies the content of the relation, or how the content should be computed.
In CozoScript, relations (stored relations or relations defined by rules) abide by the *set semantics*,
meaning that even if a rule may compute a row multiple times, it will occur only once in the output.
This is in contradistinction to SQL.
Relations in Cozo (stored or otherwise) abide by the *set semantics*.
Thus even if a rule computes a row multiple times,
the resulting relation only contains a single copy.
There are two types of named rules in CozoScript:
*inline rules* distinguished by using ``:=`` to connect the head and the body,
and *fixed rules* distinguished by using ``<~`` to connect the head and the body.
You may think that *constant rules* with ``<-`` constitute a third type, written as::
* *Inline rules*, distinguished by using ``:=`` to connect the head and the body.
The logic used to compute the resulting relation is defined *inline*.
* *Fixed rules*, distinguished by using ``<~`` to connect the head and the body.
The logic used to compute the resulting relation is *fixed* according to which algorithm or utility is requested.
The *constant rules* which use ``<-`` to connect the head and the body are syntax sugar. For example::
const_rule[a, b, c] <- [[1, 2, 3], [4, 5, 6]]
but this is merely syntax sugar for the fixed rule of the ``Constant`` utility::
is identical to::
const_rule[a, b, c] <~ Constant(data: [[1, 2, 3], [4, 5, 6]])
@ -56,66 +56,64 @@ Each row in the named rule is then *unified* with the bindings given as paramete
here the first column is unified with a constant string, and unification succeeds only when the string
completely matches what is given;
the second column is unified with the *variable* ``b``,
and as the variable is fresh at this point (meaning that it first appears here),
the unification will always succeed and the variable will become *bound*:
from this point take on the value of whatever it was
and as the variable is fresh at this point (because this is its first appearance),
the unification will always succeed. For subsequent atoms, the variable becomes *bound*:
it take on the value of whatever it was
unified with in the named relation.
When a bound variable is unified again, for example ``b`` in ``rule_b[b, d, a, e]``,
this unification will only succeed when the unified value is the same as the current value.
Thus, repeated use of the same variable in named rules corresponds to inner joins in relational algebra.
When a bound variable is used again later, for example in ``rule_b[b, d, a, e]``, the variable ``b`` was bound
at this point, this unification will only succeed when the unified value is the same as the previously unified value.
In other words, repeated use of the same variable in named rules corresponds to inner joins in relational algebra.
Another flavour of atoms is the *stored relation*. It may be written similarly to a rule application::
Atoms representing applications of *stored relations* are written as::
*stored_relation[bind1, bind2]
with the colon in front of the stored relation name to distinguish it from rule application.
Written in this way, you must give as many bindings to the stored relation as its arity,
and the bindings proceed by argument positions, which may be cumbersome and error-prone.
So alternatively, you may use the fact that columns of a stored relation are always named and bind by name::
with the asterisk before the name.
Written in this way using square brackets, as many bindings as the arity of the stored relation must be given.
You can also bind columns by name::
*stored_relation{col1: bind1, col2: bind2}
In this case, you only need to bind as many variables as you use.
If the name you want to give the binding is the same as the name of the column, you may use the shorthand notation:
``*stored_relation{col1}`` is the same as ``*stored_relation{col1: col1}``.
In this form, any number of columns may be omitted.
If the name you want to give the binding is the same as the name of the column, you can write instead
``*stored_relation{col1}``, which is the same as ``*stored_relation{col1: col1}``.
*Expressions* are also atoms, such as::
a > b + 1
Here ``a`` and ``b`` must be bound somewhere else in the rule, and the expression must evaluate to a boolean,
and act as a *filter*: only rows where the expression evaluates to true are kept.
``a`` and ``b`` must be bound somewhere else in the rule. Expression atoms must evaluate to booleans,
and act as *filters*. Only rows where the expression atom evaluates to ``true`` are kept.
You can also use *unification atoms* to unify explicitly::
*Unification atoms* unify explicitly::
a = b + c + d
for such atoms,
whatever appears on the left-hand side must be a single variable and is unified with the right-hand side.
This is different from the equality operator ``==``,
where both sides are merely required to be expressions.
When the left-hand side is a single *bound* variable,
it may be shown that the equality and the unification operators are semantically equivalent.
Whatever appears on the left-hand side must be a single variable and is unified with the result of the right-hand side.
Another form of *unification atom* is the explicit multi-unification::
.. NOTE::
This is different from the equality operator ``==``,
where the left-hand side is a completely bound expression.
When the left-hand side is a single *bound* variable,
the equality and the unification operators are equivalent.
*Unification atoms* can also unify with multiple values in a list::
a in [x, y, z]
here the variable on the left-hand side of ``in`` is unified with each item on the right-hand side in turn,
which in turn implies that the right-hand side must evaluate to a list
(but may be represented by a single variable or a function call).
If the right-hand side does not evaluate to a list, an error is raised.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Head and returned relation
Head
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Atoms, as explained above, corresponds to either relations (or their projections) or filters in relational algebra.
Linked by commas, they, therefore, represent a joined relation, with named columns.
As explained above, Atoms correspond to either relations, projections or filters in relational algebra.
Linked by commas, they therefore represent a joined relation, with columns either constants or variables.
The *head* of the rule, which in the simplest case is just a list of variables,
then defines whichever columns to keep, and their order in the output relation.
then defines the columns to keep in the output relation and their order.
Each variable in the head must be bound in the body, this is one of the *safety rules* of Datalog.
Each variable in the head must be bound in the body (the *safety rule*).
Not all variables appearing in the body need to appear in the head.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -124,45 +122,31 @@ Multiple definitions and disjunction
For inline rules only, multiple rule definitions may share the same name,
with the requirement that the arity of the head in each definition must match.
The returned relation is then the *disjunction* of the multiple definitions,
which correspond to *union* in SQL.
*Intersect* in SQL can be written in CozoScript into a single rule since commas denote conjunction.
In complicated situations, you may instead write disjunctions in a single rule with the explicit ``or`` operator::
rule1[a, b] := rule2[a] or rule3[a], rule4[a, b]
The returned relation is then formed by the *disjunction* of the multiple definitions (a *union* of rows).
For completeness, there is also an explicit ``and`` operator, but it is semantically identical to the comma,
except that
it has higher operator precedence than ``or``, which in turn has higher operator precedence than the comma.
You may also use the explicit disjunction operator ``or`` in a single rule definition::
During evaluation, each rule is canonicalized into
`disjunction normal form <https://en.wikipedia.org/wiki/Disjunctive_normal_form>`_
and each clause of the outmost disjunction is treated as a separate rule.
The consequence is that the safety rule may be violated
even though textually every variable in the head occurs in the body.
As an example::
rule[a, b] := rule1[a] or rule2[b]
rule1[a, b] := rule2[a] or rule3[a], rule4[a, b]
is a violation of the safety rule since it is rewritten into two rules, each of which is missing a different binding.
There is also an ``and`` operator, semantically identical to the comma ``,``
but has higher operator precedence than ``or`` (the comma has the lowest precedence).
^^^^^^^^^^^^^^^^
Negation
^^^^^^^^^^^^^^^^
Atoms in inline rules may be *negated* by putting ``not`` in front of them, as in::
Atoms in inline rules may be *negated* by putting ``not`` in front of them::
not rule1[a, b]
When negating rule applications and stored relations,
at least one binding must be bound somewhere else in the rule in a non-negated context:
this is another safety rule of Datalog, and it ensures that the outputs of rules are always finite.
The unbound bindings in negated rules remain unbound: negation cannot introduce bound bindings to be used in the head.
at least one binding must be bound somewhere else in the rule in a non-negated context (another *safety rule*).
The unbound bindings in negated rules remain unbound: negation cannot introduce new bindings to be used in the head.
Negated expressions act as negative filters,
which is semantically equivalent to putting ``!`` in front of the expression.
Since negation does not introduce new bindings,
unifications and multi-unifications are converted to equivalent expressions and then negated.
Explict unification cannot be negated unless the left-hand side is bound,
in which case it is treated as an expression atom and then negated.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Recursion and stratification
@ -170,80 +154,62 @@ Recursion and stratification
The body of an inline rule may contain rule applications of itself,
and multiple inline rules may apply each other recursively.
The only exception is the entry rule ``?``, which cannot be referred to by other rules.
The only exception is the entry rule ``?``, which cannot be referred to by other rules including itself.
Self and mutual references allow recursion to be defined easily. To guard against semantically pathological cases,
recursion cannot occur in negated positions: the Russell-style rule ``r[a] := not r[a]`` is not allowed.
This requirement creates an ordering of the rules, since
negated rules must evaluate to completion before rules that apply them can start evaluation:
this is called *stratification* of the rules.
In cases where a total ordering cannot be defined since there exists a loop in the ordering
required by negation, the query is then deemed unstratifiable and Cozo will refuse to execute it.
Recursion cannot occur in negated positions (*safety rule*): ``r[a] := not r[a]`` is not allowed.
Note that since CozoScript allows unifying fresh variables, you can still easily write programs that produce
infinite relations and hence cannot complete through recursion, but that are still accepted by the database.
One of the simplest examples is::
.. WARNING::
As CozoScript allows explicit unification,
queries that produce infinite relations may be accepted by the compiler.
One of the simplest examples is::
r[a] := a = 0
r[a] := r[b], a = b + 1
?[a] := r[a]
r[a] := a = 0
r[a] := r[b], a = b + 1
?[a] := r[a]
It is up to the user to ensure that such programs are not submitted to the database,
as it is not even in principle possible for the database to rule out such cases without wrongly rejecting valid queries.
If you accidentally submitted one, you can refer to the system ops section for how to terminate long-running queries.
Or you can give a timeout for the query when you submit.
It is not even in principle possible for Cozo to rule out all infinite queries without wrongly rejecting valid ones.
If you accidentally submitted one, refer to the system ops chapter for how to terminate queries.
Alternatively, you can give a timeout for the query when you submit.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Aggregation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
CozoScript supports aggregations, as does SQL, which provides a very useful extension to pure relational algebra.
In CozoScript, aggregations are specified for inline rules by applying aggregation operators to variables
in the rule head, as in::
In CozoScript, aggregations are specified for inline rules by applying *aggregation operators* to variables
in the rule head::
?[department, count(employee)] := *personnel{department, employee}
here we have used the ``count`` operator familiar to all SQL users.
The semantics is that any variables in the head without aggregation operators are treated as *grouping variables*,
similar to what appears in a ``GROUP BY`` clause in SQL, and the aggregation is applied using the grouping variables
as keys. If you do not specify any grouping variables, then you get at most one row as the return value.
here we have used the familiar ``count`` operator.
Any variables in the head without aggregation operators are treated as *grouping variables*,
and aggregation is applied using them as keys.
If you do not specify any grouping variables, then the resulting relation contains at most one row.
As we now understand, CozoScript follows relational algebra with set semantics.
With the introduction of aggregations, the situation is a little bit more complicated,
as aggregations are applied to the relation resulting from the body of the rule using bag semantics,
and the resulting relation of the rule, after aggregations are applied, follows set semantics.
Aggregation operators are applied to the rows computed by the body of the rule using bag semantics.
The reason for this complication is that if aggregations are applied with set semantics, then the following query::
?[count(employee)] := *personnel{employee}
does not do what you expect: it either returns a row with a single value ``1`` if there are any matching rows,
or it returns nothing at all if the stored relation ``*personnel`` is empty.
Though semantically sound, this behaviour is not useful at all.
So for aggregations, we opt for bag semantics, and the query does what one expects.
or it returns nothing at all if the stored relation is empty.
If a rule has several definitions, they must have identical aggregations applied in the same positions,
otherwise, the query will be rejected.
The reason is that in complicated situations the semantics is ambiguous and counter-intuitive if we do allow it.
If a rule has several definitions, they must have identical aggregations applied in the same positions.
Existing database systems do not usually allow aggregations through recursion,
since in many cases, it is difficult to give useful semantics to such queries.
In CozoScript we allow aggregations for self-recursion for a limited subset of aggregation operators,
the so-called *semi-lattice aggregations*, such as the following example shows::
Cozo allows aggregations for self-recursion for a limited subset of aggregation operators,
the so-called *semi-lattice aggregations*::
shortest_distance[destination, min(distance)] :=
route{source: 'A', destination, distance}
shortest_distance[destination, min(distance)] := route{source: 'A', destination, distance}
shortest_distance[destination, min(distance)] :=
shortest_distance[existing_node, prev_distance], # recursion
route{source: existing_node, distance: route_distance},
distance = prev_distance + route_distance
?[destination, min_distance] := shortest_distance[destination, min_distance]
this query computes the shortest distances from a node to all nodes using the ``min`` aggregation operator.
Concerning stratification, if a rule has aggregations in its head,
then any rule that contains it in an atom must be in a higher stratum,
unless that rule is the same rule (self-recursion) and all aggregations in its head are semi-lattice aggregations.
?[destination, min_distance] :=
shortest_distance[destination, min_distance]
Consult the dedicated chapter for the aggregation operators available.
Here self-recursion of ``shortest_distance`` contains the ``min`` aggregation.
----------------------------------
Fixed rules
@ -252,54 +218,46 @@ Fixed rules
The body of a fixed rule starts with the name of the utility or algorithm being applied,
then takes a specified number of named or stored relations as its *input relations*,
followed by *options* that you provide.
The following query is a calculation of PageRank::
For example::
?[] <~ PageRank(*route[], theta: 0.5)
In the above example, the relation ``*route`` is the single input relation expected.
Input relations may be stored relations or relations resulting from rules.
Each utility/algorithm expects specific shapes of input relations,
for example, PageRank expects the first two columns of the relation to denote the source and destination
of links in a graph. You must consult the documentation for each utility/algorithm to understand its API.
Each utility/algorithm expects specific shapes for their input relations.
You must consult the documentation for each utility/algorithm to understand its API.
In fixed rules, bindings for input relations are usually omitted, but sometimes if they are provided
they are interpreted and used in case-specific ways, for example in the DFS algorithm bindings
can be used to construct an expression for testing the termination condition.
In the example given above, ``theta`` is an option of the algorithm,
they are interpreted and used in algorithm-specific ways, for example in the DFS algorithm bindings.
In the example above, ``theta`` is an option of the algorithm,
which is required by the API to be an expression evaluating to a constant.
Each utility/algorithm expects specific types for the options;
some options have default values and may be omitted.
Each fixed rule has a determinate output arity,
deduced from the specific utility/algorithm being applied and the options given.
Usually, you omit the bindings in the rule head, as we do above,
but if you do provide bindings, the arities must match.
In terms of stratification, each fixed rule lives in its own stratum:
it is evaluated after all rules it depends on are completely evaluated,
and all rules depending on the output relation of a fiex rule start evaluation only after complete evaluation
of the fixed rule.
In particular, unlike inline rules, there is no early termination even if the output relation
is for the entry rule.
Each fixed rule has a determinate output arity.
Thus, the bindings in the rule head can be omitted,
but if they are provided, you must abide by the arity.
-----------------------
Query options
-----------------------
Each query can have query options associated with it::
Each query can have options associated with it::
?[name] := *personnel{name}
:limit 10
:offset 20
In the example, ``:limit`` and ``:offset`` are query options, with familiar meanings from SQL.
In the example, ``:limit`` and ``:offset`` are query options with familiar meanings.
All query options start with a single colon ``:``.
Queries options can appear before or after rules, or even sandwiched between rules.
Use this freedom for better readability.
Several query options deal with transactions for the database.
Those will be discussed in the chapter on stored relations and transactions.
Here we explain query options that exclusively affect the query itself.
The rest of the query options are explained in the following.
.. module:: QueryOp
:noindex:
@ -323,26 +281,33 @@ Here we explain query options that exclusively affect the query itself.
If specified, the query will wait for ``<N>`` seconds after completion,
before committing or proceeding to the next query.
Seconds may be specified as an expression so that random timeouts are possible.
This is useful for deliberately interleaving concurrent queries to test for complex logic.
Useful for deliberately interleaving concurrent queries to test complex logic.
.. function:: :sort <SORT_ARG> (, <SORT_ARG>)*
Sort the output relation before applying other options or returning.
Sort the output relation. If ``:limit`` or ``:offset`` are specified, they are applied after ``:sort``.
Specify ``<SORT_ARG>`` as they appear in the rule head of the entry, separated by commas.
You can optionally specify the sort direction of each argument by prefixing them with ``+`` or ``-``,
with minus denoting descending sort. As an example, ``:sort -count(employee), dept_name``
sorts by employee count descendingly first, then break ties with department name in ascending alphabetical order.
Note that your entry rule head must contain both ``dept_name`` and ``count(employee)``:
aggregations must be done in inline rules, not in output sorting. ``:order`` is an alias for ``:sort``.
with minus denoting descending order, e.g. ``:sort -count(employee), dept_name``
sorts by employee count in reverse order first,
then break ties with department name in ascending alphabetical order.
.. WARNING::
Aggregations must be done in inline rules, not in output sorting. In the above example,
the entry rule head must contain ``count(employee)``, ``employee`` alone is not acceptable.
.. function:: :order <SORT_ARG> (, <SORT_ARG>)*
Alias for ``:sort``.
.. function:: :assert none
With this option, the query returns nothing if the output relation is empty, otherwise execution aborts with an error.
Essential for transactions and triggers.
The query returns nothing if the output relation is empty, otherwise execution aborts with an error.
Useful for transactions and triggers.
.. function:: :assert some
With this option, the query returns nothing if the output relation contains at least one row,
The query returns nothing if the output relation contains at least one row,
otherwise, execution aborts with an error.
Execution of the query stops as soon as the first row is produced if possible.
Essential for transactions and triggers.
Useful for transactions and triggers.

@ -5,7 +5,7 @@ Getting started
Welcome to the Cozo Manual. The latest version of this manual can be read at https://cozodb.github.io/current/manual.
Alternatively, you can download a PDF version for offline viewing at https://cozodb.github.io/current/manual.pdf.
We aim to make sure that this manual at least touches upon all features currently implemented in Cozo,
This manual touches upon all features currently implemented in Cozo,
though the coverage of some topics may be sketchy at this stage.
This manual assumes that you already know the basics of the Cozo database,
@ -16,15 +16,11 @@ Downloading Cozo
------------------------
Cozo is distributed as a single executable.
Precompiled binaries can be downloaded from the `release page <https://github.com/cozodb/cozo/releases>`_ on the GitHub repo,
which are currently available for Linux (Intel x64), Mac (Intel x64 and Apple ARM) and Windows (Intel x64).
As the build process on Windows is internally very different from UNIX-based systems,
the Windows build hasn't received as much attention as the other builds,
and may suffer from inferior performance and Windows-specific bugs.
Precompiled binaries can be downloaded from the `release page <https://github.com/cozodb/cozo/releases>`_,
currently available for Linux (Intel x64), Mac (Intel x64 and Apple ARM) and Windows (Intel x64).
For Windows users,
we recommend running Cozo under `WSL <https://learn.microsoft.com/en-us/windows/wsl/install>`_ if possible,
especially if your workload is heavy.
especially if your workload is heavy, as the Windows version runs more slowly.
---------------
Starting Cozo
@ -38,7 +34,7 @@ If ``<PATH_TO_DATA_DIRECTORY>`` does not exist, it will be created.
Cozo will then start a web server and bind to address ``127.0.0.1`` and port ``9070``.
These two can be customized: run the executable with the ``-h`` option to learn how.
To stop Cozo, type ``CTRL-C`` in the terminal, or send ``SIGTERM`` to the process with e.g. ``kill``.
To stop Cozo, press ``CTRL-C`` in the terminal, or send ``SIGTERM`` to the process with e.g. ``kill``.
-----------------------
The query API
@ -46,7 +42,7 @@ The query API
Queries are run by sending HTTP POST requests to the server.
By default, the API endpoint is ``http://127.0.0.1:9070/text-query``.
The structure of the expected JSON payload is::
A JSON body is expected::
{
"script": "<COZOSCRIPT QUERY STRING>",
@ -54,34 +50,26 @@ The structure of the expected JSON payload is::
}
``params`` should be an object of named parameters.
For example, if you have ``params`` set up to be ``{"num": 1}``,
For example, if ``params`` is ``{"num": 1}``,
then ``$num`` can be used anywhere in your query string where an expression is expected.
Always use ``params`` instead of constructing query strings yourself when you have parametrized queries.
As an example, the following runs a system op with the ``curl`` command line tool::
curl -X POST localhost:9070/text-query \
-H 'content-type: application/json' \
-d '{"script": "::running", "params": {}}'
Always use ``params`` instead of concatenating strings when you need parametrized queries.
.. WARNING::
Cozo is designed to run in a trusted environment and be used by trusted clients,
therefore it does not come with elaborate authentication and security features.
Cozo is designed to run in a trusted environment and be used by trusted clients.
It does not come with elaborate authentication and security features.
If you must access Cozo remotely,
you are responsible for setting up firewalls, encryptions and proxies yourself.
As a guard against users carelessly binding Cozo to any address other than ``127.0.0.1``
and potentially exposing content to everyone on the Internet,
in this case,
As a guard against users accidentally exposing sensitive data,
If you bind Cozo to non-loopback addresses,
Cozo will generate a token string and require all queries
from non-loopback addresses to provide the token string
in the HTTP header field ``x-cozo-auth``.
The warning printed when you start Cozo with a non-default binding will tell you how to find the token string.
Please note that this "security measure" is not considered sufficient for any purpose
and is only a last defence
when every other security measure that you are responsible for setting up fails.
The warning printed when you start Cozo with a non-default binding will tell you
where to find the token string.
This "security measure" is not considered sufficient for any purpose
and is only intended as a last defence against carelessness.
--------------------------------------------------
Running queries
@ -91,27 +79,34 @@ Running queries
Making HTTP requests
^^^^^^^^^^^^^^^^^^^^^^^^^^
As Cozo has a web-based API,
As Cozo has a HTTP-based API,
it is accessible by all languages that are capable of making web requests.
The structure of the API is also deliberately kept minimal so that no dedicated clients are necessary.
The return values of requests are JSON when requests are successful,
As an example, the following runs a system op with the ``curl`` command line tool::
curl -X POST localhost:9070/text-query \
-H 'content-type: application/json' \
-d '{"script": "::running", "params": {}}'
The responses are JSON when queries are successful,
or text descriptions when errors occur,
so a language only needs to be able to process JSON to use Cozo.
^^^^^^^^^^^^^^^^^^^^^^^^^
JupyterLab
^^^^^^^^^^^^^^^^^^^^^^^^^
`JupyterLab <https://jupyterlab.readthedocs.io/en/stable/>`_ is a web-based notebook interface
in the python ecosystem heavily used by data scientists and is the recommended "IDE" of Cozo.
Cozo has special support for running queries in `JupyterLab <https://jupyterlab.readthedocs.io/en/stable/>`_,
a web-based notebook interface
in the python ecosystem heavily used by data scientists.
First, install JupyterLab by following the install instructions of the project.
Then install the pycozo library by running::
First, install JupyterLab by following its instructions.
Then install the ``pycozo`` library::
pip install "pycozo[pandas]"
Now, open the JupyterLab web interface, start a Python 3 kernel,
Open the JupyterLab web interface, start a Python 3 kernel,
and in a cell run the following `magic command <https://ipython.readthedocs.io/en/stable/interactive/magics.html>`_::
%load_ext pycozo.ipyext_direct
@ -119,35 +114,33 @@ and in a cell run the following `magic command <https://ipython.readthedocs.io/e
If you need to connect to Cozo using a non-default address or port,
or you require an authentication string, you need to run the following magic commands as well::
%cozo_host http://<BIND_ADDRESS>:<PORT>
%cozo_auth <YOUR_AUTH_STRING>
%cozo_host http://<ADDRESS>:<PORT>
%cozo_auth <AUTH_STRING>
Now you can execute cells as you usually do in JupyterLab,
and the content of the cells will be sent to Cozo and interpreted as CozoScript.
Now, when you execute cells in the notebook,
the content will be sent to Cozo and interpreted as CozoScript.
Returned relations will be formatted as `Pandas dataframe <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_.
The above sets up the notebook in the Direct Cozo mode,
where cells are default interpreted as CozoScript.
You can still execute python code by starting the first line of a cell with the ``%%py``.
There is also an Indirect Cozo mode, started by::
The extension ``pycozo.ipyext_direct`` used above sets up the notebook in the Direct Cozo mode,
where cells are by default interpreted as CozoScript.
Python code can be run by starting the first line of a cell with the ``%%py``.
The Indirect Cozo mode can be started by::
%load_ext pycozo.ipyext
In this mode, only cells with the first line content ``%%cozo`` are interpreted as CozoScript.
Other cells are interpreted in the normal way (by default, python code).
Which mode you use depends on your workflow.
We recommend the Indirect mode if you have lots of post-processing and visualizations.
In this mode, only cells with the first line ``%%cozo`` are interpreted as CozoScript.
Other cells are interpreted in the normal way (python code).
The Indirect mode is useful if you need post-processing and visualizations.
When a query is successfully executed,
the result will be bound to the python variable ``_`` as a Pandas dataframe
(this is a feature of Jupyter notebooks: the Cozo extension didn't do anything extra).
When a query execution is successfully,
the resulting Pandas dataframe will be bound to the python variable ``_``.
There are a few other useful magic commands:
A few other magic commands are available:
* ``%cozo_run_file <PATH_TO_FILE>`` runs a local file as CozoScript.
* ``%cozo_run_string <VARIABLE>`` runs variable containing string as CozoScript.
* ``%cozo_set <KEY> <VALUE>`` sets a parameter with the name ``<KEY>`` to the expression ``<VALUE>``.
The set parameters will be used by subsequent queries.
The updated parameters will be used by subsequent queries.
* ``%cozo_set_params <PARAM_MAP>`` replace all parameters by the given expression,
which must evaluate to a dictionary with string keys.
* ``%cozo_clear`` clears all set parameters.
@ -157,13 +150,8 @@ There are a few other useful magic commands:
The Makeshift JavaScript Console
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Python and JupyterLab ecosystem is rather heavy-weight.
If you are just testing out or running Cozo in an environment that only occasionally requires manual queries,
you may be reluctant to install them.
In this case, you may find the Makeshift JavaScript Console helpful.
As Cozo is running an HTTP service,
we assume that the browser on your local machine can reach its network.
If you are reluctant to install python and Jupyter, you may consider the Makeshift JavaScript Console.
To get started, you need a browser on your local machine.
We recommend `Firefox <https://www.mozilla.org/en-US/firefox/new/>`_, `Chrome <https://www.google.com/chrome/>`_,
or any Chromium-based browser for best display.
@ -177,52 +165,46 @@ and switch to the "Console" tab. Now you can execute CozoScript by running::
await run("<COZOSCRIPT>")
The returned tables will be properly formatted.
If you need to pass in parameters, provide a second parameter with a JavaScript object.
If you need to set an auth string, modify the global variable ``COZO_AUTH``.
The returned relations will be formatted as tables.
If you need to pass in parameters, provide a second parameter with a JavaScript object::
The JavaScript Console is not as nice to use as Jupyter notebooks,
but we think that it provides a much better experience than hand-rolled CLI consoles,
since you can use JavaScript to manipulate the results.
await run("<COZOSCRIPT>", <PARAMS>)
If you need to set an auth string, modify the global variable ``COZO_AUTH``.
----------------------------
Building Cozo from source
----------------------------
If for some reason the binary distribution does not work for you,
you can build Cozo from source, which is straightforward.
you can build Cozo from source.
You need to install the `Rust toolchain <https://www.rust-lang.org/tools/install>`_ on your system.
You also need a C++17 compiler.
First, clone the Cozo git repo::
Clone the Cozo git repo::
git clone https://github.com/cozodb/cozo.git --recursive
You need to pass the ``--recursive`` flag so that submodules are also cloned.
Then you need to install the `Rust toolchain <https://www.rust-lang.org/tools/install>`_ for your system.
You also need a C++17 compiler.
After these preparations, run::
You need to pass the ``--recursive`` flag so that submodules are also cloned. Next, run in the root of the cloned repo::
cargo build --release
in the root of the cloned repo,
wait for potentially a long time, and you will find the compiled binary in ``target/release``
if everything goes well.
Wait for potentially a long time, and you will find the compiled binary in ``target/release``.
You can run ``cargo build --release -F jemalloc`` instead
to indicate that you want to compile and use jemalloc as the memory allocator for the RocksDB storage backend,
which, depending on your workload, can make a difference in performance.
which can make a difference in performance depending on your workload.
--------------------------------
Embedded use
Embedding Cozo
--------------------------------
Here "embedded" means running in the same process as your program.
As ``cozoserver`` is just a very thin wrapper around the Cozo rust library,
you can use the library directly in your program.
You can run Cozo in the same process as your main program.
For Rust programs, as ``cozoserver`` is just a very thin wrapper around the Cozo rust library,
you can use the library directly.
For languages other than Rust, you will need to provide custom bindings,
but again for `Python <https://pyo3.rs/>`_ and `NodeJS <https://neon-bindings.com/>`_ this is trivial.
As Cozo uses RocksDB as its storage engine, it probably makes a lot less sense to use embedded Cozo than SQLite,
since RocksDB always uses multiple threads and always has background threads running,
unlike SQLite for which a connection uses only a single thread.
Note that Cozo, with its underlying RocksDB storage, will always use multiple threads, embedded or not.

@ -2,15 +2,15 @@
Stored relations and transactions
====================================
Persistent databases store data on disk. As Cozo is a relational database,
data are stored in *stored relations* on disk, which is analogous to tables in SQL databases.
In Cozo, data are stored in *stored relations* on disk.
---------------------------
Stored relations
---------------------------
We already know how to query stored relations:
use the ``*relation[...]`` or ``*relation{...}`` atoms in inline or fixed rules.
To query stored relations,
use the ``*relation[...]`` or ``*relation{...}`` atoms in inline or fixed rules,
as explained in the last chapter.
To manipulate stored relations, use one of the following query options:
.. module:: QueryOp
@ -18,39 +18,39 @@ To manipulate stored relations, use one of the following query options:
.. function:: :create <NAME> <SPEC>
Creates a stored relation with the given name and the given spec.
The named stored relation must not exist before.
If a query is specified, data from the resulting relation is put into the created stored relation.
Create a stored relation with the given name and spec.
No stored relation with the same name can exist beforehand.
If a query is specified, data from the resulting relation is put into the newly created stored relation.
This is the only stored relation-related query option in which a query may be omitted.
.. function:: :replace <NAME> <SPEC>
This is similar to ``:create``, except that if the named stored relation exists beforehand,
Similar to ``:create``, except that if the named stored relation exists beforehand,
it is completely replaced. The schema of the replaced relation need not match the new one.
You cannot omit the query for ``:replace``.
If there are any triggers associated, they will be preserved. Note that this may lead to errors if ``:replace``
leads to schema change.
.. function:: :put <NAME> <SPEC>
Put data from the resulting relation into the named stored relation.
If keys from the data exist beforehand, the rows are simply replaced with new ones.
Put rows from the resulting relation into the named stored relation.
If keys from the data exist beforehand, the corresponding rows are replaced with new ones.
.. function:: :ensure <NAME> <SPEC>
Ensures that rows specified by the output relation and spec already exist in the database
and that no other process has written to these rows at commit since the transaction starts.
Ensure that rows specified by the output relation and spec exist in the database,
and that no other process has written to these rows when the enclosing transaction commits.
Useful for ensuring read-write consistency.
.. function:: :rm <NAME> <SPEC>
Remove data from the resulting relation from the named stored relation.
Only keys are used.
If a row from the resulting relation does not match any keys, nothing happens for that row,
and no error is raised.
Remove rows from the named stored relation. Only keys should be specified in ``<SPEC>``.
Removing a non-existent key is not an error and does nothing.
.. function:: :ensure_not <NAME> <SPEC>
Ensures that rows specified by the output relation and spec do not exist in the database
and that no other process has written to these rows at commit since the transaction starts.
Ensure that rows specified by the output relation and spec do not exist in the database
and that no other process has written to these rows when the enclosing transaction commits.
Useful for ensuring read-write consistency.
You can rename and remove stored relations with the system ops ``::relation rename`` and ``::relation remove``,
@ -60,10 +60,10 @@ described in the system op chapter.
Create and replace
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The format of ``<SPEC>`` is identical for all four ops, whereas the semantics is a bit different.
The format of ``<SPEC>`` is identical for all four ops, but the semantics is a bit different.
We first describe the format and semantics for ``:create`` and ``:replace``.
A spec is a specification for columns, enclosed in curly braces ``{}`` and separated by commas::
A spec, or a specification for columns, is enclosed in curly braces ``{}`` and separated by commas::
?[address, company_name, department_name, head_count] <- $input_data
@ -75,21 +75,18 @@ A spec is a specification for columns, enclosed in curly braces ``{}`` and separ
address: String,
}
Columns before the symbol ``=>`` form the *keys* (actually, a composite key) for the stored relation,
Columns before the symbol ``=>`` form the *keys* (actually a composite key) for the stored relation,
and those after it form the *values*.
If all columns are keys, the symbol ``=>`` may be omitted altogether.
The order of columns matters in the specification,
especially for keys, as data is stored in lexicographically sorted order in trees,
which has implications for data access in queries.
Each key corresponds to a single value.
If all columns are keys, the symbol ``=>`` may be omitted.
The order of columns matters.
Rows are stored in lexicographically sorted order in trees according to their keys.
In the above example, we explicitly specified the types for all columns.
Type specification is described in its own chapter.
If the types of the rows do not match the specified types,
the system will first try to coerce the values, and if that fails, the query is aborted.
You can selectively omit types for columns, and columns with types omitted will have the type ``Any?``,
which is valid for any value.
As an example, if you do not care about type validation, the above query can be written as::
In case of type mismatch,
the system will first try to coerce the values given, and if that fails, the query is aborted with an error.
You can omit types for columns, in which case their types default to ``Any?``,
i.e. all values are acceptable.
For example, the above query with all types omitted is::
?[address, company_name, department_name, head_count] <- $input_data
@ -108,10 +105,11 @@ You can also explicitly specify the correspondence::
address: String = b
}
You *must* use explicit correspondence if the entry head contains aggregation.
The ``address`` field shows how to specify both a type and a correspondence.
You *must* use explicit correspondence if the entry head contains aggregation,
since names such as ``count(c)`` are not valid column names.
The ``address`` field above shows how to specify both a type and a correspondence.
Instead of specifying bindings, you can specify an expression to generate values::
Instead of specifying bindings, you can specify an expression that generates default values by using ``default``::
?[a, b] <- $input_data
@ -123,7 +121,7 @@ Instead of specifying bindings, you can specify an expression to generate values
address default ''
}
The expression is evaluated once for each row, so for example if you specified one of the UUID-generating functions,
The expression is evaluated anew for each row, so if you specified a UUID-generating functions,
you will get a different UUID for each row.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -132,10 +130,8 @@ Put, remove, ensure and ensure-not
For ``:put``, ``:remove``, ``:ensure`` and ``:ensure_not``,
you do not need to specify all existing columns in the spec if the omitted columns have a default generator,
in which case the generator will be used to generate a value,
or the type of the column is nullable, in which case the value is ``null``.
The spec specified when the relation was created will be consulted to know how to store data correctly.
Specifying default values does not have any effect and will not replace existing ones.
or if the type of the column is nullable, in which case the value defaults to ``null``.
For these operations, specifying default values does not have any effect and will not replace existing ones.
For ``:put`` and ``:ensure``, the spec needs to contain enough bindings to generate all keys and values.
For ``:rm`` and ``:ensure_not``, it only needs to generate all keys.
@ -151,71 +147,88 @@ by wrapping each query in curly braces ``{}``.
Each query can have its independent query options.
Execution proceeds for each query serially, and aborts at the first error encountered.
The returned relation is that of the last query.
Within a transaction,
execution of queries adheres to multi-version concurrency control: only data that are already committed,
or written within the same transaction, are read,
and at the end of the transaction, any changes to stored relations are only committed if there are no conflicts
and no errors are raised.
The ``:assert``, ``:ensure`` and ``:ensure_not`` query options allow you to express complicated constraints
The ``:assert (some|none)``, ``:ensure`` and ``:ensure_not`` query options allow you to express complicated constraints
that must be satisfied for your transaction to commit.
This example uses three queries to put and remove rows atomically
(either all succeed or all fail), and ensure that at the end of the transaction
an untouched row exists::
{
?[a, b] <- [[1, 'one'], [3, 'three']]
:put rel {a => b}
}
{
?[a] <- [[2]]
:rm rel {a}
}
{
?[a, b] <- [[4, 'four']]
:ensure rel {a => b}
}
When a transaction starts, a snapshot is used,
so that only already committed data,
or data written within the same transaction, are visible to queries.
At the end of the transaction, changes are only committed if there are no conflicts
and no errors are raised.
If any mutation activate triggers, those triggers execute in the same transaction.
------------------------------------------------------
Triggers and indices
------------------------------------------------------
Cozo does not have traditional indices on stored relations.
You must define your indices as separate stored relations yourself,
for example by having a relation containing identical data but in different column order.
More complicated and exotic "indices" are also possible and used in practice.
Instead, you define regular stored relations that are used as indices.
At query time, you explicitly query the index instead of the original stored relation.
You synchronize your indices and the original by ensuring that any mutations you do on the database
write the correct data to the "canonical" relation and its indices in the same transaction.
As doing this by hand for every mutation in your business logic leads to lots of repetitions,
is error-prone and a maintenance nightmare,
Cozo also supports *triggers* to do it automatically for you.
As doing this by hand for every mutation leads to lots of repetitions
and is error-prone,
Cozo supports *triggers* to do it automatically for you.
You attach triggers to a stored relation by running the system op ``::relation set_triggers``::
You attach triggers to a stored relation by running the system op ``::set_triggers``::
::relation set_triggers relation_name
::set_triggers <REL_NAME>
on put { <QUERY> }
on put { <QUERY> } # you can specify as many triggers as you need
on rm { <QUERY> }
on replace { <QUERY> }
on put { <QUERY> } # you can specify as many triggers as you need
You can have anything valid query for ``<QUERY>``.
``<QUERY>`` can be any valid query.
The ``on put`` queries will run when any data is inserted into the relation,
which can be triggered by ``:put``, ``:create`` and ``:replace`` query options.
The implicitly defined rules ``_new[]`` and ``_old[]`` can be used in the queries, and
contain the added rows, and the replaced rows (if any).
The ``on put`` triggers will run when new data is inserted or upserted,
which can be activated by ``:put``, ``:create`` and ``:replace`` query options.
The implicitly defined rules ``_new[]`` and ``_old[]`` can be used in the triggers, and
contain the added rows and the replaced rows respectively.
The ``on rm`` queries will run when deletion is triggered by the ``:rm`` query option.
The implicitly defined rules ``_new[]`` and ``_old[]`` can be used in the queries,
the first rule contains the keys of the rows for deletion, and the second rule contains the rows
actually deleted, with both keys and non-keys.
The ``on rm`` triggers will run when data is deleted, which can be activated by a ``:rm`` query option.
The implicitly defined rules ``_new[]`` and ``_old[]`` can be used in the triggers,
and contain the keys of the rows for deleted rows (even if no row with the key actually exist) and the rows
actually deleted (with both keys and non-keys).
The ``on replace`` queries will run when ``:replace`` query options are run.
They are run before any ``on put`` triggers are run for the same stored relation.
The ``on replace`` triggers will be activated by a ``:replace`` query option.
They are run before any ``on put`` triggers.
All triggers for a relation must be specified together, in the same system op.
In other words, ``::relation set_triggers`` simply replaces all the triggers associated with a stored relation.
To remove all triggers from a stored relation, simply pass no queries for the system op.
All triggers for a relation must be specified together, in the same ``::set_triggers`` system op.
If used again, all the triggers associated with the stored relation are replaced.
To remove all triggers from a stored relation, use ``::set_triggers <REL_NAME>`` followed by nothing.
As a very simple example of using triggers to maintain an index, let's say we have the following relation::
As an example of using triggers to maintain an index, suppose we have the following relation::
:create rel {a => b}
However, we often want to use ``*rel[a, b]`` with ``b`` bound but ``a`` unbound. This will cause a full scan,
We often want to query ``*rel[a, b]`` with ``b`` bound but ``a`` unbound. This will cause a full scan,
which can be expensive. So we need an index::
:create rel.rev {b, a}
In the generate case, we cannot assume a functional dependency ``b => a``, so here both fields appear as keys.
In the general case, we cannot assume a functional dependency ``b => a``, so in the index both fields appear as keys.
To manage the index automatically, simply do::
To manage the index automatically::
::relation set_triggers rel
@ -233,14 +246,14 @@ To manage the index automatically, simply do::
With the index set up, you can use ``*rel.rev{..}`` in place of ``*rel{..}`` in your queries.
Indices in Cozo are manual, but extremely flexible, since you need not conform to any predetermined patterns
in your use of ``_old[]`` and ``_new[]``. You also need to explicitly use the indices in your query:
there is no "query optimization" for them. For simple queries, this can seem cumbersome, but for complex ones,
the deterministic evaluation entailed can be a huge blessing.
in your use of ``_old[]`` and ``_new[]``.
For simple queries, the need to explicitly elect to use an index can seem cumbersome,
but for complex ones, the deterministic evaluation entailed can be a huge blessing.
Besides indices, creative use of triggers abounds, but you must consider the maintenance burden they introduce.
Triggers can be creatively used for other purposes as well.
.. WARNING::
Do not introduce loops in your triggers.
Loops in your triggers can cause non-termination.
A loop occurs when a relation has triggers which affect other relations,
which in turn have other triggers that ultimately affect the starting relation.

@ -12,21 +12,21 @@ In the following, we explain what each system op does, and the arguments they ex
Explain
--------------
.. function:: ::explain { <query> }
.. function:: ::explain { <QUERY> }
A single query is enclosed in curly braces. Query options are allowed but ignored.
The query is not executed, but its query plan is returned instead.
Currently, there is no specification for the return format,
but if you are familiar with the semi-naïve evaluation of stratified Datalog programs
subject to magic-set rewrites, the returned data is pretty self-explanatory.
subject to magic-set rewrites, you can decipher the result.
----------------------------------
Ops on stored relations
Ops for stored relations
----------------------------------
.. function:: ::relations
List all stored relations currently in the database
List all stored relations in the database
.. function:: ::columns <REL_NAME>
@ -44,9 +44,9 @@ Ops on stored relations
Display triggers associated with the stored relation ``<REL_NAME>``.
.. function:: ::set_triggers <REL_NAME> <TRIGGERS>
.. function:: ::set_triggers <REL_NAME> ...
Set triggers for the stored relation ``<REL_NAME>``. This is explained in more detail in the transactions chapter.
Set triggers for the stored relation ``<REL_NAME>``. This is explained in more detail in the transaction chapter.
.. function:: ::access_level <REL_NAME> <ACCESS_LEVEL>
@ -57,7 +57,8 @@ Ops on stored relations
* ``read_only`` additionally disallows any mutations and setting triggers,
* ``hidden`` additionally disallows any data access (metadata access via ``::relations``, etc., are still allowed).
It is recommended to give the appropriate access levels to tables to prevent data loss from programming mistakes.
The access level functionality is to protect data from mistakes of the programmer,
not from attacks by malicious parties.
------------------------------------
Monitor and kill
@ -65,7 +66,7 @@ Monitor and kill
.. function:: ::running
Display currently running queries and their IDs.
Display running queries and their IDs.
.. function:: ::kill <ID>
@ -77,5 +78,5 @@ Maintenance
.. function:: ::compact
Run compaction on the database. Running this may make the database smaller on disk and faster for queries,
but running the op itself may take some time in the background.
Instructs Cozo to run a compaction job.
Compaction makes the database smaller on disk and faster for read queries.
Loading…
Cancel
Save