# Air-data acrobatics

## Hello, world!

Let's start exploring the Cozo database by following the "hello world" tradition:

In [5]:
?[a, b, c] <- [['hello', 'world', 'Cozo!']]

a,b,c
hello,world,Cozo!


Let's break that down. This query consists of two parts, the part before `<-` is called its _head_, and the part after is called its _body_. The symbol `<-` itself denotes that this is a _constant rule_, or a declaration of _facts_.

The head has the special name `?`, indicating the _entry_ of the query, which has three _arguments_ `a`, `b`, and `c`.

The body consists of a list of lists (in this case a list of a single inner list). Each inner list represents a _tuple_, which is similar to a row in a relational database. The length of the inner list must match the number of arguments of the head, and each argument is then _bound_ to the corresponding value in the inner list by position.

Of course more than one inner list is allowed:

In [6]:
?[a, b, c] <- [['hello', 'world', 'Cozo!'],
               ['hello', 'world', 'database!']]

a,b,c
hello,world,Cozo!
hello,world,database!


Let's try the following:

In [7]:
?[a] <- [['hello'], ['world'], ['Cozo!']]

a
Cozo!
hello
world


Now we have three inner lists of length 1 each. The returned results is also _sorted_: all relations in Cozo are sorted lexicographically by position.

Cozo operates on _set semantics_ instead of _bag semantics_: observe

In [8]:
?[a] <- [['hello'], ['world'], ['Cozo!'], ['hello'], ['world'], ['Cozo.']]

a
Cozo!
Cozo.
hello
world


`'hello'` and `'world'` both appear only once in the result, even though they appear twice each in the input. Set semantics automatically de-duplicates based on the whole tuple.

## Values and expressions

The list of lists in the body of the rules certainly look familiar to anyone who have used languages such as JavaScript or Python. In fact, with the exception of the map `{}`, valid JSON values represent valid Cozo values.

As sorting is important in Cozo, study the following example, which demonstrates how different values are sorted:

In [18]:
?[a] <- [[true],
         [false], 
         [null],
         ["A"], 
         ['apple'], // single or double quotes are both OK 
         ["Apple juice"], 
         [['apple', 1, [2, 3]]],  // this row consists of a list consisting of heterogeneous items!
         [1.0], 
         [1_234_567], // you can separate digits with underscores for clarity
         [3.14159], 
         [-8e-99]]

a
""
false
true
-8e-99
1
3.14159
1234567
A
Apple juice
apple


Notice how comments are entered, just like in JavaScript. `/* ... */` also works.

In the playground, literal strings appear in black, numbers in blue, and reddish entries represent values that should be parsed as JSON.

Even though the kind of rule we have been using is called the _constant rule_, you can in fact compute in them:

In [76]:
?[i, a] <- [[1, 1 + 2], 
            [2, 3 * 4], 
            [3, 5 / 6], 
            [4, exp(7)], 
            [5, uppercase('number ') ++ to_string(10)],  // string concatenation
            [6, to_float('PI')]]

i,a
1,3
2,12
3,0.8333333333333334
4,1096.6331584284585
5,NUMBER 10
6,3.141592653589793


for clarity we have used the index `i` to force the result to show in this order.

For the full list of functions you can use in expressions, consult the Manual.

There is one thing we need to make clear at this point. In CozoScript, only `true` is true, and only `false` is false. This is not a tautology: every other value, including `null`, produces error when put in a position requiring a truthy value. In this sense, `null` in CosoScript is only a _marker_. It has no inherent logical semantics associated with it, unlike `NULL` in SQL, `null` and `undefeined` in Javascript, and `None` in Python. An example:

In [67]:
?[a] <- [[!null]]

In this case you really need to write

In [68]:
?[a] <- [[!is_null(null)]]

a
False


This may seem a nuisance in trivial cases, but will save you a lot of hair in hairy situations. Believe me.

## Horn-clause rules

Usually constant rules are used to define ad-hoc facts useful for subsequent queries:

In [78]:
?[loving, loved] := loves[loving, loved] // Yes, this is the 'subsequent query'. In a logical sense. 
                                         // The order of rules has no significance whatsoever.

loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

loving,loved
alice,eve
bob,alice
charlie,eve
david,george
eve,alice
eve,bob
eve,charlie
george,george


The constant rule is now named `loves`, denoting a rather complicated relationship network (aren't 'relationship' and 'network' synonyms?). It reads like "Alice loves Eve, Bob loves Alice", "nobody loves David, David loves George, but George only loves himself", and so on. Note that for constant rules we can actually omit the arguments (but if explicitly given, the arity must match the actual data).

The entry `?` is now a _Horn-clause rule_, signified by the symbol `:=`. Its body has a single _application_ of the rule we have just defined, with _bindings_ `loving` and `loved` for the arguments. These bindings are then carried to the output via the arguments of the entry rule.

Here both bindings to the rule application of `loves` are initially _unbound_, in which case all tuples of `loves` are returned. To _bind_ an argument simply pass a constant in:

In [80]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loved_by_eve] := loves['e' ++ 'v' ++ 'e', loved_by_eve] // Eve loves dramatic entrance

loved_by_eve
alice
bob
charlie


Every argument position can be bound:

In [27]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loves_eve] := loves[loves_eve, 'eve']

loves_eve
alice
charlie


Multiple clauses can appear in the body, in which case an implicit conjunction is implied, meaning that all clauses
must bind for a result to return:

In [39]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loved_by_b_e] := loves['eve', loved_by_b_e], loves['bob', loved_by_b_e]

loved_by_b_e
alice


We see that Alice is loved by both Bob and Eve. The variable `loved_by_b_e` appears in both clauses, in which case they are _unified_, meaning that they must bind to the _same_ value for a tuple to return.

Disjunction, meaning that _any_ clause with successful binding potentially contribute to results, must be specified explicitly:

In [75]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loved_by_b_e] := loves['eve', loved_by_b_e] or loves['bob', loved_by_b_e], 
                   loved_by_b_e != 'bob', 
                   loved_by_b_e != 'eve'

loved_by_b_e
alice
charlie


As we can see, disjunctive clauses are connected by `or`. It binds more strongly than the implicit conjunction `,`.

Horn clause rules (and Horn clause rules only) may have multiple definitions _having equivalent heads_. The above query is identical in every way to the following:

In [44]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loved_by_b_e] := loves['eve', loved_by_b_e], loved_by_b_e != 'bob', loved_by_b_e != 'eve'
?[loved_by_b_e] := loves['bob', loved_by_b_e], loved_by_b_e != 'bob', loved_by_b_e != 'eve'

loved_by_b_e
alice
charlie


If a Horn clause rule is not the entry, even the _names_ given to the arguments can differ. The bodies are not required to be of the same form, as long as they produce compatible outputs.

Besides rule applications, _filters_ can also appear in the body:

In [33]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[person, loved] := loves[person, loved], !ends_with(person, 'e')

person,loved
bob,alice
david,george


In this case only people with name not ending in `'e'` are considered for the loving position.

By the way, if you are not interested in who the person in the loving position is, you can just omit it in the arguments to the entry:

In [34]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loved] := loves[person, loved], !ends_with(person, 'e')

loved
alice
george


... but every argument in the head of any Horn-clause rule must appear in the body, of course:

In [35]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[the_alien, loved] := loves[person, loved], !ends_with(person, 'e')

## Negation

The next query finds those who are loved by Eve, but not by Bob:

In [60]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loved_by_e_not_b] := loves['eve', loved_by_e_not_b], not loves['bob', loved_by_e_not_b]

loved_by_e_not_b
bob
charlie


Here we are using the `not` keyword to _negate_ the rule application `loves`. This negation is at the level of Horn-clauses, which is not the same as the level of expressions. In fact, there are two sets of related but inequivalent operators:

* For Horn clauses: `,` (conjunction), `or` (disjunction), `not` (negation)
* For boolean expressions: `&&` (conjunction), `||` (disjunction), `!` (negation)

Hopefully you are already familiar with the boolean set of operators. If you use them in the wrong way, the query compiler will yell at you. And you will comply.

Negation has to abide by the _safety rule_. Let's violate it:

In [64]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[not_loved_by_b] := not loves['bob', not_loved_by_b]

Oh no! The query compiler rejects our perfectly reasonable query trying to determine those poor souls not loved by Bob!

But is our query really reasonable? For example, should the query return a tuple containing 'gold', since according to facts at hand, Bob clearly has no interest in 'gold'? So should our query return every possible string except a select few? Do you want your computer to handle such a query?

Now you understand what the help message above is trying to tell you.

To make our query really reasonable, we have to explicitly give our query a _closed world_ in which to operate the negation:

In [65]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]
            
the_population[p] := loves[p, _a]
the_population[p] := loves[_a, p]

?[not_loved_by_b] := the_population[not_loved_by_b], not loves['bob', not_loved_by_b]

not_loved_by_b
bob
charlie
david
eve
george


Now the query understands that we are asking our question _within_ the people in the love network. It then proceeds without complaints.

Let's state the **safety rule for negation**: _at least one_ argument of the rule application must be bound elsewhere (otherwise the clause will produce an infinity of candidate tuples), and _all arguments_ to negated clauses are _not_ considered bound, _unless_ they also appear elsewhere in a positive context.

If you can't wrap your head around the rule yet, don't worry. Just write your query. Return here and reread this section when you encounter some error messages similar to the above.

## Unification

We have seen that variables with repeated appearance in rule applications and predicates are implicitly unified. You can also _explicitly_ unify a variable with the unify operator `<-`:

In [46]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loves_eve] := eve <- 'eve', loves[loves_eve, eve]

loves_eve
alice
charlie


By the way, the _order_ a clause appears in a Horn-clause rule can never affect the result in any way (provided your queries do not contain random functions):

In [47]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

?[loves_eve] := loves[loves_eve, eve], eve <- 'eve'

loves_eve
alice
charlie


... but the performance might vary, sometimes greatly. This is an advanced topic that we will come back to in a later session. For trivial examples like ours it doesn't matter. In your own explorations, just try to put more 'restrictive' rules first (meaning that they filter out a greater number of tuples), and you will be fine most of the time.

There is also the spread-unify operator `<- ..`, which unifies the left hand side with values in a list one at a time:

In [50]:
?[u] := u <- ..['a', 'b', 'c']

u
a
b
c


Another example: this is the "Cartesian product"

In [52]:
?[u, v] := u <- ..['a', 'b', 'c'], v <- ..['x', 'y']

u,v
a,x
a,y
b,x
b,y
c,x
c,y


You may notice that paired with functions extracting elements from lists, we don't actually need constant rules anymore. But constant rules are more explicit when you really have _facts_ as inputs.

## Recursion

Now we come to the "poster boy" query of classical Datalog: let's find out all the people loved by Alice, or loved by someone loved by Alice, or loved by someone loved by someone loved by Alice, _ad infinitum_:

In [56]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

alice_love_chain[person] := loves['alice', person]
alice_love_chain[person] := alice_love_chain[in_person], loves[in_person, person]

?[chained] := alice_love_chain[chained]

chained
alice
bob
charlie
eve


Someone "chained" is either loved by Alice directly, or loved by someone already in the chain. The query as written reads very naturally. This is why this "transitive closure" type of query is the poster-boy query of classical Datalog. 

Writing the same thing in SQL requires recursive CTE, and those CTEs escalate pretty quickly. On the other hand, if well written, Datalog queries can weather very demanding situations and remain readable.

Recursive queries are an essential part for graphs (networks). So they had better be easy to write _and_ read in a database claiming to be optimized for graphs.

We've talked about the safety rule for negation above. You may suspect that something similar is at play here. Let's retry the above query, but omit the starting condition `alice_love_chain[person] := loves['alice', person]`:

In [66]:
loves[] <- [['alice', 'eve'],
            ['bob', 'alice'],
            ['eve', 'alice'],
            ['eve', 'bob'],
            ['eve', 'charlie'],
            ['charlie', 'eve'],
            ['david', 'george'],
            ['george', 'george']]

alice_love_chain[person] := alice_love_chain[in_person], loves[in_person, person]

?[chained] := alice_love_chain[chained]

chained


Are you surprised that the compiler did not complain? Are you surprised that it returned no results? This is the _closed-world assumption_ hinted above at play again. If there is no way to _deduce_ a fact from the given facts, _then_ the fact itself is false.

This so called "least fixed point" semantics is the semantics of Datalog queries. This semantics is actually subtly different from SQL, due to the existence of `UNKNOWN` in SQL, usually manifesting as `NULL`. In other worlds, SQL operates on [ternary logic](https://en.wikipedia.org/wiki/Three-valued_logic) whereas Datalog stays boolean all the way (under the protection of the closed world assumptions).

Still, there are _rules_ with respect to recursion. [Bertrand Russell](https://en.wikipedia.org/wiki/Russell%27s_paradox) would rush to write:

In [74]:
world[a] := a <- ..[1, 2]

p[a] := world[a], not q[a]
q[a] := world[a], not p[a]

?[a] := p[a]

The above query does not violate the safety rule of negation (because he put a `world` in front of each negation), but the compiler still rejects it. Don't worry about the unworldly incantation the error makes. Instead, think for a moment what the result _could_ be.

You can verify that the result could be the single tuple `[1]` with the assignment `p[a] <- [[1]]` and `q[a] <- [[2]]`, _or_ the single tuple `['q']` with the assignment `p[a] <- [[2]]` and `q[a] <- [[1]]`. The problem is, these answers contradict each other, and neither can be deduced _constructively_. So under the least fixed point semantics, this program has no _meaning_, and the compiler rejects it.

Again, don't worry if you can't exactly follow what is going on. Just trust that the compiler is trying to prevent your computer from imploding. Real applications don't tend to produce these kinds of contrived, paradoxical queries anyway.

## Conclusion

That's it! You have learned the basics of Datalog in the dialect CozoScript!

If you want to play more without going further for the moment, it is recommended that you skim through the list of functions in the Manual. Those functions allow you to do much more acrobatics with pure Datalog.

"We've seen data, but where is the BASE of dataBASE?", you ask, not content of being merely an air-datarist.

I'm glad you asked. Let's go to our base now!