# The Cozo database tutorial

This tutorial will teach you the basics of using the Cozo database with the query language CozoScript.
There are no database-specific prerequisites, 
though it would be helpful if you already know some other databases, 
especially SQL databases.

## Setup

The best way to learn from this tutorial is to run the queries as they are introduced.
For this, you need to install the Cozo database on your local machine.

Cozo is distributed as a single gzipped/zipped binary. 
Go to https://github.com/cozodb/cozo/releases and download the latest release binary for your operating system to a local directory.

If your operating system is Linux/Mac, 
open a terminal command prompt, 
`cd` into the directory of your download,
and run

```bash
gunzip cozoserver-*.gz
mv cozoserver-* cozoserver
chmod +x cozoserver
```

If you are on Windows instead, open a PowerShell and run

```powershell
Expand-Archive -Path .\cozoserver-*.zip -DestinationPath .
```

To run the server, you need to specify a directory to store persistent data on your file system. 
In the following, we will use a directory called `tutorial-data` in the same directory as the binary executable.
In the terminal, run

```bash
./cozoserver ./tutorial-data
```

The same command should work in PowerShell as well.

If you see something like `Database web API running at ...` displayed in your terminal, 
then the server is successfully started. 
Keep the server running when you are following the tutorial.
When you are done, `CTRL-C` in the terminal will stop the server.
You can restart the server again by running the command again.

More options when starting the server are available. Run

```bash
./cozoserver -h
```

for more details.

## A place to run queries

Cozo exposes an HTTP API, so theoretically you can follow along using tools like `curl`. 
If you are interested, consult the [manual](https://cozodb.github.io/current/manual/setup.html#the-query-api) for the request format the API expects.
For a better user experience, we suggest following one of the following two subsections instead.

### Option 1: the JupyterLab notebook

This option provides the best user experience but also requires you to install quite a lot of things, 
though you may already have them installed on your computer if you use the python data science stack.
First, you will need python installed. 
Then install JupyterLab by following the instruction at https://jupyter.org/install.
Next, run the following to install a Jupyter extension to help query Cozo:

```bash
pip install pycozo pandas
```

While you are at it, go to the source of this tutorial at https://github.com/cozodb/cozo/blob/main/tutorial/tutorial.ipynb, 
right-click on `Raw` and save the tutorial document to your disk.

Then run Jupyter Lab, open the saved tutorial document, and follow along.

We need to enable the extension in the notebook. Run

In [1]:
%load_ext pycozo.ipyext_direct

Then the "hello world" query:

In [159]:
?[] <- [['hello', 'world', 'Cozo!']]

Unnamed: 0,0,1,2
0,hello,world,Cozo!


If you get the same words back formatted in a table, congratulations! 
You can skip to the next section where we start learning CozoScript proper.
If you want to know more about what the `pycozo` extension did and more tricks that you can do with the extension, read the [manual](https://cozodb.github.io/current/manual/setup.html#jupyterlab).

### Option 2: the JavaScript console in your browser

If you have never used Python before, the first option may be overwhelming. 
Or you just want to try Cozo out first to decide quickly if it has anything interesting for you.
Whatever your reason for not wanting to install the whole python toolchain, we have you covered.

Your local machine at least has a modern web browser, like a recent version of Firefox, Chrome, or Edge, right?
Good. 

Use your browser to navigate to http://127.0.0.1:9070 (the address shown in your terminal when you run `cozoserver`).
You should be greeted by a page saying that the server is running.
Now open the developer tools of your browser by right-clicking the page and selecting "Inspect" from the menu 
(if you cannot find it, you may need to fiddle with your browser settings to enable the developer tools).
Switch to the "Console" tab of the developer tools if it is not already open. 

If you see some messages where 
the "Cozo Makeshift Javascript Console" welcomes you, you are ready. Run the "hello world" query by typing the following into the console and pressing enter:

```javascript
await run(`?[] <- [['hello', 'world', 'Cozo!']]`)
```

If you see the three words echoed back in a table, you are successful. When following the tutorial, you have to wrap all queries within the backticks `` in the above command to run them in the JavaScript console.

## Your first relations

Cozo is a relational database. The "hello world" query

In [3]:
?[] <- [['hello', 'world', 'Cozo!']]

Unnamed: 0,0,1,2
0,hello,world,Cozo!


as you might have guessed, simply passes an ad hoc relation, here represented by a list of lists, and ask the database to return the relation to you.

You can pass more rows, or a different number of columns, to corroborate further your guess:

In [4]:
?[] <- [[1, 2, 3], ['a', 'b', 'c']]

Unnamed: 0,0,1,2
0,1,2,3
1,a,b,c


This example shows how to enter literals for numbers, strings, booleans and `null`:

In [5]:
?[] <- [[1.5, 2.5, 3, 4, 5.5], 
        ['aA', 'bB', 'cC', 'dD', 'eE'], 
        [true, false, null, -1.4e-2, "A string with double quotes"]]

Unnamed: 0,0,1,2,3,4
0,True,False,,-0.014000,A string with double quotes
1,1.500000,2.500000,3,4,5.500000
2,aA,bB,cC,dD,eE


The literal representations are similar to those in JavaScript. 
In particular, strings in double quotes are guaranteed to be interpreted in the same way as in JSON.

You may be surprised by the order of the returned rows in the last example: the returned order is not the same as the input order.
This is because in Cozo relations are stored (either in memory or on disk) as trees, and trees are always sorted.

Another consequence of trees is that you can have no duplicate rows:

In [6]:
?[] <- [[1], [2], [1], [2], [1]]

Unnamed: 0,0
0,1
1,2


We say that relations in Cozo follow _set semantics_ where de-duplication is automatic. 
By contrast, SQL usually follows _bag semantics_ (some databases do this by secretly having a unique internal key for every row, in Cozo you must do this explicitly if you need to simulate duplicate rows).

Why does Cozo break tradition and go with set semantics?
Set semantics is much more convenient when you have recursions between relations involved,
and Cozo is designed to deal with very complicated recursions.

## Expressions

The next example shows the use of various expressions and comments:

In [11]:
?[] <- [[
            1 + 2, # addition
            3 / 4, # division
            5 == 6, # equality
            7 > 8, # greater
            true || false, # or
            false && true, # and
            lowercase('HELLO'), # function
            rand_float(), # function taking no argument
            union([1, 2, 3], [3, 4, 5], [5, 6, 7]), # variadic function
        ]]

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,3,0.75,False,False,True,False,hello,0.65635,"[1, 2, 3, 4, 5, 6, 7]"


Notice in the last column the use of list literals within expressions. 
See [here](https://cozodb.github.io/current/manual/functions.html) the full list of functions.
The syntax is deliberately made almost identical to C-like languages.

## Rules and relations

Previous examples all start with `?[] <-`, which denotes a _rule_ named `?`, which is a _constant rule_, which when evaluated just echoes the list of lists back as a relation.

Rules can have other names, but the rule named `?` is special in that its evaluation determines the return relation of the query.

Before we go beyond constant rules, note that we can give _bindings_ in the _head_ of rules:

In [17]:
?[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]

Unnamed: 0,first,second,third
0,1,2,3
1,a,b,c


If you give bindings, the number of bindings must match the actual data, otherwise, you will get an error:

In [19]:
?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]

[31mparser::fixed_rule_head_arity_mismatch[0m

  [31m×[0m Fixed rule head arity mismatch
   ╭────
 [2m1[0m │ ?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]
   · [35;1m─────────────────────────────────────────────────[0m
   ╰────
[36m  help: [0mExpected arity: 3, number of arguments given: 2


Now let's define rules that use other rules:

In [21]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[a, b, c] := rule[a, b, c]

Unnamed: 0,a,b,c
0,1,2,3
1,a,b,c


This first defines a constant rule named `rule`. The `?` rule is now an _inline rule_, denoted by the connecting symbol `:=`. In its body it _applies_ the fixed rule, by giving the name of the rule followed by three _fresh bindings_, which are the _variables_ `a`, `b` and `c`.

With inline rules, you can manipulate the order of the columns, or what columns are returned:

In [25]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule[a, b, c]

Unnamed: 0,c,b
0,3,2
1,c,b


The body of an inline rule, which are the things to the right of the connecting symbol `:=`, consists of _atoms_. 
The previous example has a single rule application atom as the body. Multiple atoms are connected by commas:

In [29]:
?[c, b] := rule[a, b, c], is_num(a)
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]

Unnamed: 0,c,b
0,3,2


Here the second atom is an _expression_ `is_num(a)`. 
Only rows for which the expression evaluates to `true` are returned, so expression atoms act as filters. 
By the way, we see that the order in which the rules are given is immaterial.

You can also bind constants to rule applications directly:

In [30]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule['a', b, c]

Unnamed: 0,c,b
0,c,b


You introduce additional bindings with the _unification operator_ `=`:

In [31]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b, d] := rule[a, b, c], is_num(a), d = a + b + 2*c

Unnamed: 0,c,b,d
0,3,2,9


Having multiple rule applications in the body generates every combination of the bindings:

In [34]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2] := r1[a, l1], 
             r2[b, l2]

Unnamed: 0,l1,l2
0,a,B
1,a,C
2,b,B
3,b,C


This corresponds to a Cartesian join in relational algebra. 
Notice the bindings in the rule applications are all distinct.
If bindings are reused, then we get the effect of _unification_,
which corresponds to joins in relational algebra:

In [35]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2] := r1[a, l1], 
             r2[a, l2] # reused `a`

Unnamed: 0,l1,l2
0,b,B


The explicit unification `=` unifies with a single value. There is another kind of unification that unifies values within a list. Observe:

In [69]:
?[x, y] := x in [1, 2, 3], y in ['x', 'y']

Unnamed: 0,x,y
0,1,x
1,1,y
2,2,x
3,2,y
4,3,x
5,3,y


For the head of inline rules, you do not need to use all variables that appear in the body. But whatever you use must appear in the body, this is called the _safety rule_:

In [36]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2, x] := r1[a, l1], 
                r2[a, l2]

[31meval::unbound_symb_in_head[0m

  [31m×[0m Symbol 'x' in rule head is unbound
   ╭─[3:1]
 [2m3[0m │ 
 [2m4[0m │ ?[l1, l2, x] := r1[a, l1], 
   · [35;1m          ─[0m
 [2m5[0m │                 r2[a, l2]
   ╰────
[36m  help: [0mNote that symbols occurring only in negated positions are not considered bound


## Stored relations

The constructs that we have introduced already cover most of what relational algebra can do. That's some economy of syntax! 
However, as a database, we need to know how to store data persistently. In Cozo, persistent relations are called _stored relations_.

There is no ceremony required at all when you want to store data in Cozo:

In [37]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2] := r1[a, l1], 
             r2[a, l2]
    
:create stored {l1, l2}

Unnamed: 0,status
0,OK


The query itself is identical to the one which we have run before, except we have added the `:create` query option, instructing the system to store the result in a stored relation named `stored`, containing the columns `l1` and `l2`.

By the way, if you just want to create the relation without adding any data, you can omit the queries. No need to have an empty `?` query.

You can verify that you now have the required stored relation in your system by running a _system op_:

In [38]:
::relations

Unnamed: 0,name,arity,n_keys,n_non_keys,n_put_triggers,n_rm_triggers,n_replace_triggers
0,stored,2,2,0,0,0,0


You can also investigate the columns of the stored relation:

In [40]:
::relation columns stored 

Unnamed: 0,column,is_key,index,type,has_default
0,l1,True,0,Any?,False
1,l2,True,1,Any?,False


Stored relations can be used in a similar way to relations defined via inline rules or fixed rules. The only difference is that you prefix the relation name with a colon:

In [41]:
?[a, b] := :stored[a, b]

Unnamed: 0,a,b
0,b,B


Unlike relations defined inline, the columns of stored relations have fixed names. You can use this to your advantage by selectively referring to columns by name.
This is especially useful if you have a lot of columns:

In [42]:
?[a, b] := :stored{l2: b, l1: a}

Unnamed: 0,a,b
0,b,B


If you are fine with using the name of the column as the binding, a shorthand is available:

In [44]:
?[l2] := :stored{l2}

Unnamed: 0,l2
0,B


Inserting more data into stored relation is easy by using the `:put` query option:

In [45]:
?[l1, l2] <- [['e', 'E']]
    
:put stored {l1, l2}

Unnamed: 0,status
0,OK


In [46]:
?[l1, l2] := :stored[l1, l2]

Unnamed: 0,l1,l2
0,b,B
1,e,E


To remove rows, use the `:rm` query option:

In [47]:
?[l1, l2] <- [['e', 'E']]
    
:rm stored {l1, l2}

Unnamed: 0,status
0,OK


In [48]:
?[l1, l2] := :stored[l1, l2]

Unnamed: 0,l1,l2
0,b,B


You can get rid of a stored relation with the following:

In [49]:
::relation remove stored

Unnamed: 0,status
0,OK


In [50]:
::relations

Unnamed: 0,name,arity,n_keys,n_non_keys,n_put_triggers,n_rm_triggers,n_replace_triggers


As we have mentioned, every relation in Cozo is a tree. Stored relation is no exception.
So far, our trees store all the data in their keys.
You can instruct Cozo to only treat some of the data as keys, thereby indicating a _functional dependency_:

In [51]:
?[a, b, c] <- [[1, 'a', 'A'],
               [2, 'b', 'B'],
               [3, 'c', 'C'],
               [4, 'd', 'D']]

:create fd {a, b => c}

Unnamed: 0,status
0,OK


In [53]:
?[a, b, c] := :fd[a, b, c]

Unnamed: 0,a,b,c
0,1,a,A
1,2,b,B
2,3,c,C
3,4,d,D


Now if you insert another row with an existing key, that row will be updated:

In [55]:
?[a, b, c] <- [[3, 'c', 'CCCCCCC']]

:put fd {a, b => c}

Unnamed: 0,status
0,OK


In [58]:
?[a, b, c] := :fd[a, b, c]

Unnamed: 0,a,b,c
0,1,a,A
1,2,b,B
2,3,c,CCCCCCC
3,4,d,D


You can easily check whether a column is in a key position by looking at the `is_key` column in the following:

In [59]:
::relation columns fd

Unnamed: 0,column,is_key,index,type,has_default
0,a,True,0,Any?,False
1,b,True,1,Any?,False
2,c,False,2,Any?,False


You may have noticed that columns also have types and default values associated with them, and stored relations can have triggers. These are discussed in the [manual](https://cozodb.github.io/current/manual/stored.html).
We won't overload you with all the complexities in this tutorial.

Before continuing, let's remove the stored relation we introduced:

In [60]:
::relation remove fd

Unnamed: 0,status
0,OK


## Graphs

Now let's consider a graph, stored as a relation. Let's first make it a stored relation:

In [61]:
?[loving, loved] <- [['alice', 'eve'],
                     ['bob', 'alice'],
                     ['eve', 'alice'],
                     ['eve', 'bob'],
                     ['eve', 'charlie'],
                     ['charlie', 'eve'],
                     ['david', 'george'],
                     ['george', 'george']]

:replace love {loving, loved}

Unnamed: 0,status
0,OK


The graph we have created reads like "Alice loves Eve, Bob loves Alice", "nobody loves David, David loves George, but George only loves himself", and so on. 
Here we used `:replace` instead of `:create`. The difference is that if `love` already exists, it will be wiped and replaced with the new data given.

With the graph available, we can investigate competing interests:

In [62]:
?[loved_by_b_e] := :love['eve', loved_by_b_e], :love['bob', loved_by_b_e]

Unnamed: 0,loved_by_b_e
0,alice


So far we have only seen bodies consisting of _conjunction_ of atoms. Disjunction is also available, by using the `or` keyword:

In [63]:
?[loved_by_b_e] := :love['eve', loved_by_b_e] or :love['bob', loved_by_b_e], 
                   loved_by_b_e != 'bob', 
                   loved_by_b_e != 'eve'

Unnamed: 0,loved_by_b_e
0,alice
1,charlie


Another way to write the same query is to have multiple definitions of the same rule, with different bodies:

In [64]:
?[loved_by_b_e] := :love['eve', loved_by_b_e], 
                   loved_by_b_e != 'bob', 
                   loved_by_b_e != 'eve'
?[loved_by_b_e] := :love['bob', loved_by_b_e], 
                   loved_by_b_e != 'bob', 
                   loved_by_b_e != 'eve'

Unnamed: 0,loved_by_b_e
0,alice
1,charlie


The first way of writing the query (using `or`) is just syntax sugar for the second way. When you have multiple definitions of the same inline rule, the rule heads must be compatible. Fixed rules cannot have multiple definitions.

## Negation

The next example demonstrates filters using negated expressions, which should already be familiar now:

In [65]:
?[loved] := :love[person, loved], !ends_with(person, 'e')

Unnamed: 0,loved
0,alice
1,george


Rule applications can also be negated. Not with the `!` operator, but with the `not` keyword instead:

In [66]:
?[loved_by_e_not_b] := :love['eve', loved_by_e_not_b], not :love['bob', loved_by_e_not_b]

Unnamed: 0,loved_by_e_not_b
0,bob
1,charlie


You can say that there are two sets of logical operations in Cozo, one set that acts on the level of expressions, and another set that acts on the level of atoms:

* For atoms: `,` or `and` (conjunction), `or` (disjunction), `not` (negation)
* For expressions: `&&` (conjunction), `||` (disjunction), `!` (negation)

The difference between `,` and `and` is operator precedence: `and` has higher precedence than `or`, whereas `,` has lower precedence than `or`.

Negation of atoms has to abide by the _safety rule_. Let's violate it:

In [67]:
?[not_loved_by_b] := not :love['bob', not_loved_by_b]

[31meval::unbound_symb_in_head[0m

  [31m×[0m Symbol 'not_loved_by_b' in rule head is unbound
   ╭────
 [2m1[0m │ ?[not_loved_by_b] := not :love['bob', not_loved_by_b]
   · [35;1m  ──────────────[0m
   ╰────
[36m  help: [0mNote that symbols occurring only in negated positions are not considered bound


Why is this query not allowed? Well, what can it possibly return?
For example, should the query return 'gold', since according to the facts at hand, 
Bob has no interest in 'gold'? 
So should our query return every possible string except a select few? 
That's not reasonable.

To make our query reasonable, we have to explicitly give our query a _closed world_ in which to operate the negation:

In [68]:
the_population[p] := :love[p, _a]
the_population[p] := :love[_a, p]

?[not_loved_by_b] := the_population[not_loved_by_b], not :love['bob', not_loved_by_b]

Unnamed: 0,not_loved_by_b
0,bob
1,charlie
2,david
3,eve
4,george


## Recursion

Inline rules can refer to other rules by applying them. Inline rules can have multiple definitions. If you combine these two, you get recursions:

In [70]:
alice_love_chain[person] := :love['alice', person]
alice_love_chain[person] := alice_love_chain[in_person], :love[in_person, person]

?[chained] := alice_love_chain[chained]

Unnamed: 0,chained
0,alice
1,bob
2,charlie
3,eve


Someone "chained" is either loved by Alice directly or loved by someone already in the chain. The query as written reads very naturally.

You may object that you only need to be able to refer to other rules by applying them to have recursion, and multiple definitions are not required. Technically, true, but the resulting queries are not useful. Observe:

In [71]:
alice_love_chain[person] := alice_love_chain[in_person], :love[in_person, person]

?[chained] := alice_love_chain[chained]

Unnamed: 0,chained


This is the _closed-world assumption_. If there is no way to _deduce_ a fact from the given facts, _then_ the fact itself is false. You need multiple definitions to "bootstrap" the query.

You can do crazy things with recursion and negation. Fortunately, Cozo will try to stop you when you want to run something unreasonable:

In [72]:
world[a] := a in [1, 2]

p[a] := world[a], not q[a]
q[a] := world[a], not p[a]

?[a] := p[a]

[31meval::unstratifiable[0m

  [31m×[0m Query is unstratifiable
[36m  help: [0mThe rule 'q' is in the strongly connected component ["p", "q"],
        and is involved in at least one forbidden dependency
        (negation, non-meet aggregation, or algorithm-application).


Never mind the error message. If you consider the query as an equation to be solved, then `p[a] <- [[1]]` and `q[a] <- [[2]]` is a solution. But there is no way to _deduce_ this solution constructively. Furthermore, `q[a] <- [[1]]` and `p[a] <- [[2]]` is also a solution which is incompatible with the first.

## Aggregation

For computing statistics, _aggregations_ are useful. In Cozo, aggregations are applied in the head of inline rules:

In [73]:
?[person, count(loved_by)] := :love[loved_by, person]

Unnamed: 0,person,count(loved_by)
0,alice,2
1,bob,1
2,charlie,1
3,eve,2
4,george,2


The usual `sum`, `mean`, etc. are all available. Having aggregations apply in the head of the rule instead of in the body is powerful, as we will see later in the extended examples.

Here is the [full list](https://cozodb.github.io/current/manual/aggregations.html) of aggregations.

## Query options

We already know how to use query options to manipulate stored relations. There are also query options for controlling what is returned. For example:

In [74]:
?[loving, loved] := :love{ loving, loved }

Unnamed: 0,loving,loved
0,alice,eve
1,bob,alice
2,charlie,eve
3,david,george
4,eve,alice
5,eve,bob
6,eve,charlie
7,george,george


returns all rows. If we only want one row:

In [75]:
?[loving, loved] := :love{ loving, loved }

:limit 1

Unnamed: 0,loving,loved
0,alice,eve


sorted by `loved` in descending order, then `loving` in ascending order, and skip the first row:

In [83]:
?[loving, loved] := :love{ loving, loved }

:order -loved, loving
:offset 1

Unnamed: 0,loving,loved
0,george,george
1,alice,eve
2,charlie,eve
3,eve,charlie
4,eve,bob
5,bob,alice
6,eve,alice


Putting `-` in front of variables in `:order` clause denotes reverse order. Nothing or `+` denotes the normal order.

There are many more query options, as explained [here](https://cozodb.github.io/current/manual/queries.html#query-options).

## Fixed rules

You may be wondering why we are calling rules defined `:=` _inline_ rules. 
Well, the logic that defines how the output relation is computed is given _inline_, as a series of atoms.

By contrast, rules defined using `<-` are called _constant_ rules, which are special cases of _fixed rules_:
rules whose logic is defined in fixed implementations hidden from the user.

The `<-` syntax is syntax sugar. The full syntax is:

In [79]:
?[] <~ Constant(data: [['hello', 'world', 'Cozo!']])

Unnamed: 0,0,1,2
0,hello,world,Cozo!


Here we are using the fixed rule `Constant`, which takes one _option_ named `data`. Note the curly tail of the arrow.

Fixed rules take in some input relations, and by applying custom logic, produce their output relation. The `Constant` fixed rule take in zero input relations.

As an example of a less trivial fixed rule, let's say we want to find out who is most popular in the `love` graph. How do we define popularity? 
One way is to say that the higher [PageRank](https://en.wikipedia.org/wiki/PageRank) a person has, the more popular. Calculating PageRank using inline rules
is very awkward (but doable). Fortunately, one of the fixed rules is an optimized PageRank implementation, so let's just use it:

In [81]:
?[person, page_rank] <~ PageRank(:love[])

:order -page_rank

Unnamed: 0,person,page_rank
0,alice,1.191497
1,eve,1.191497
2,george,1.064742
3,bob,0.921087
4,charlie,0.921087
5,david,0.574623


Here the input relation is a stored relation. Input relations are distinguished from options by not having any names preceding them.

Each fixed rule is different, and you must read their [documentation](https://cozodb.github.io/current/manual/algorithms.html) to learn how to correctly use them.

In [101]:
::relation remove love

Unnamed: 0,status
0,OK


## Extended example: the air routes dataset

Now you have a basic understanding of using the various constructs of Cozo, let's deal with a less trivial dataset.

The data we are going to use, and many examples that we will present, are adapted from the book [Practical Gremlin](https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html), which teaches the Gremlin graph query language, a very different, imperative take on graphs (Datalog, by contrast, is declarative).

First, let's import the data into our database. We will use fixed rules to do that. First, we define the `airport` relation:

In [90]:
res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any', 'Any', 'Int?', 'Float?', 'Float?', 'Any', 'Any', 'Float?', 'Float?'],
              url: 'https://github.com/cozodb/cozo/raw/main/tests/air-routes-latest-nodes.csv', 
              # url: 'file://./tests/air-routes-latest-nodes.csv', 
              has_headers: true)

?[code, icao, desc, region, runways, longest, elev, country, city, lat, lon] :=
    res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon],
    label == 'airport'

:replace airport {
    code: String 
    => 
    icao: String, 
    desc: String, 
    region: String, 
    runways: Int, 
    longest: Float, 
    elev: Float, 
    country: String, 
    city: String, 
    lat: Float, 
    lon: Float
}

Unnamed: 0,status
0,OK


The `CsvReader` utility downloads a CSV file from the internet and attempts to parse its content into a relation.
When we store the relation, we specified types for the columns. The `code` column acts as a primary key for the `airport` stored relation.

If your Internet connection is slow, it might help if you download the CSV file manually to your disk and load the local file. 
The line commented out shows how to do it. The relative path is relative to the directory in which you run the `cozoserver` executable.
As the same file will be downloaded multiple times below, you may also want to download it just once to the local disk if your connection is metered.

Next is `country`:

In [94]:
res[idx, label, typ, code, icao, desc] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
              url: 'https://github.com/cozodb/cozo/raw/main/tests/air-routes-latest-nodes.csv', 
              # url: 'file://./tests/air-routes-latest-nodes.csv', 
              has_headers: true)
?[code, desc] :=
    res[idx, label, typ, code, icao, desc],
    label == 'country'

:replace country {
    code: String
    =>
    desc: String
}

Unnamed: 0,status
0,OK


`continent`:

In [95]:
res[idx, label, typ, code, icao, desc] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
              url: 'https://github.com/cozodb/cozo/raw/main/tests/air-routes-latest-nodes.csv', 
              # url: 'file://./tests/air-routes-latest-nodes.csv', 
              has_headers: true)
?[idx, code, desc] :=
    res[idx, label, typ, code, icao, desc],
    label == 'continent'

:replace continent {
    code: String
    =>
    desc: String
}

Unnamed: 0,status
0,OK


We need to make a translation table for the indices the original data use:

In [96]:
res[idx, label, typ, code] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any'],
              url: 'https://github.com/cozodb/cozo/raw/main/tests/air-routes-latest-nodes.csv', 
              # url: 'file://./tests/air-routes-latest-nodes.csv', 
              has_headers: true)
?[idx, code] :=
    res[idx, label, typ, code],

:replace idx2code { idx => code }

Unnamed: 0,status
0,OK


The `contain` relation contains information on the geographical inclusion of entities:

In [98]:
res[] <~
    CsvReader(types: ['Int', 'Int', 'Int', 'String'],
              url: 'https://github.com/cozodb/cozo/raw/main/tests/air-routes-latest-nodes.csv', 
              # url: 'file://./tests/air-routes-latest-edges.csv', 
              has_headers: true)
?[entity, contained] :=
    res[idx, fr_i, to_i, typ],
    typ == 'contains',
    :idx2code[fr_i, entity],
    :idx2code[to_i, contained]


:replace contain { entity: String, contained: String }

Unnamed: 0,status
0,OK


Finally, the `route`s between the airports. This relation is much larger than the rest and contains about 60k rows, which may take a few seconds to download and process:

In [99]:
res[] <~
    CsvReader(types: ['Int', 'Int', 'Int', 'String', 'Float?'],
              url: 'https://github.com/cozodb/cozo/raw/main/tests/air-routes-latest-edges.csv', 
              # url: 'file://./tests/air-routes-latest-edges.csv', 
              has_headers: true)
?[fr, to, dist] :=
    res[idx, fr_i, to_i, typ, dist],
    typ == 'route',
    :idx2code[fr_i, fr],
    :idx2code[to_i, to]

:replace route { fr: String, to: String => dist: Float }

Unnamed: 0,status
0,OK


We no longer need the `idx2code` relation:

In [103]:
::relation remove idx2code

Unnamed: 0,status
0,OK


Now let's verify all the relations we want are there:

In [104]:
::relations

Unnamed: 0,name,arity,n_keys,n_non_keys,n_put_triggers,n_rm_triggers,n_replace_triggers
0,airport,11,1,10,0,0,0
1,contain,2,2,0,0,0,0
2,continent,2,1,1,0,0,0
3,country,2,1,1,0,0,0
4,route,3,2,1,0,0,0


Now let's just look at some data. Start with airports:

In [106]:
?[code, city, desc, region, runways, lat, lon] := :airport{code, city, desc, region, runways, lat, lon}
    
:limit 5

Unnamed: 0,code,city,desc,region,runways,lat,lon
0,AAA,Anaa,Anaa Airport,PF-U-A,1,-17.3526,-145.509995
1,AAE,Annabah,Annaba Airport,DZ-36,2,36.822201,7.80917
2,AAL,Aalborg,Aalborg Airport,DK-81,2,57.092759,9.849243
3,AAN,Al Ain,Al Ain International Airport,AE-AZ,1,24.2617,55.6092
4,AAQ,Anapa,Anapa Airport,RU-KDA,1,45.002102,37.347301


Airports with the most runways:

In [107]:
?[code, city, desc, region, runways, lat, lon] := :airport{code, city, desc, region, runways, lat, lon}

:order -runways
:limit 10

Unnamed: 0,code,city,desc,region,runways,lat,lon
0,DFW,Dallas,Dallas/Fort Worth International Airport,US-TX,7,32.896801,-97.038002
1,ORD,Chicago,Chicago O'Hare International Airport,US-IL,7,41.9786,-87.9048
2,AMS,Amsterdam,Amsterdam Airport Schiphol,NL-NH,6,52.308601,4.76389
3,BOS,Boston,Boston Logan,US-MA,6,42.3643,-71.005203
4,DEN,Denver,Denver International Airport,US-CO,6,39.861698,-104.672997
5,DTW,Detroit,"Detroit Metropolitan, Wayne County",US-MI,6,42.212399,-83.353401
6,ATL,Atlanta,Hartsfield - Jackson Atlanta International Airport,US-GA,5,33.6367,-84.428101
7,GIS,Gisborne,Gisborne Airport,NZ-GIS,5,-38.6633,177.977997
8,HLZ,Hamilton,Hamilton International Airport,NZ-WKO,5,-37.866699,175.332001
9,IAH,Houston,George Bush Intercontinental,US-TX,5,29.9844,-95.3414


How many airports are there in total?

In [108]:
?[count(code)] := :airport{code}

Unnamed: 0,count(code)
0,3504


Let's get a distribution of the initials of the airport codes:

In [109]:
?[count(initial), initial] := :airport{code}, initial = first(chars(code))

:order initial

Unnamed: 0,count(initial),initial
0,212,A
1,235,B
2,214,C
3,116,D
4,95,E
5,76,F
6,135,G
7,129,H
8,112,I
9,80,J


More useful are the statistics of runways:

In [110]:
?[count(r), count_unique(r), sum(r), min(r), max(r), mean(r), std_dev(r)] := 
    :airport{runways: r}

Unnamed: 0,count(r),count_unique(r),sum(r),min(r),max(r),mean(r),std_dev(r)
0,3504,7,4980.0,1,7,1.421233,0.743083


Using `country`, we can find countries with no airports:

In [128]:
?[desc] := :country{code, desc}, not :airport{country: code}

Unnamed: 0,desc
0,Andorra
1,Liechtenstein
2,Monaco
3,Pitcairn
4,San Marino


The `route` relation by itself is rather boring:

In [116]:
?[fr, to, dist] := :route{fr, to, dist}

:limit 10

Unnamed: 0,fr,to,dist
0,AAA,FAC,48.0
1,AAA,MKP,133.0
2,AAA,PPT,270.0
3,AAA,RAR,968.0
4,AAE,ALG,254.0
5,AAE,CDG,882.0
6,AAE,IST,1161.0
7,AAE,LYS,631.0
8,AAE,MRS,477.0
9,AAE,ORN,477.0


It just records the starting and ending airports of each route, together with the distance. This relation only becomes useful when used as a graph.

Airports with no routes:

In [130]:
?[code, desc] := :airport{code, desc}, not :route{fr: code}, not :route{to: code}

Unnamed: 0,code,desc
0,AFW,Fort Worth Alliance Airport
1,APA,Centennial Airport
2,APK,Apataki Airport
3,BID,Block Island State Airport
4,BVS,Breves Airport
5,BWU,Sydney Bankstown Airport
6,CRC,Santa Ana Airport
7,CVT,Coventry Airport
8,EKA,Murray Field
9,GYZ,Gruyere Airport


Airports with the most out routes:

In [133]:
route_count[fr, count(fr)] := :route{fr}
?[code, n] := route_count[code, n]

:sort -n
:limit 5

Unnamed: 0,code,n
0,FRA,310
1,IST,309
2,CDG,293
3,AMS,283
4,MUC,270


How many routes are there from the European Union to the US?

In [134]:
routes[unique(r)] := :contain['EU', fr],
                     :route{fr, to},
                     :airport{code: to, country: 'US'},
                     r = [fr, to]
?[n] := routes[rs], n = length(rs)

Unnamed: 0,n
0,435


How many airports are there in the US with routes from the EU?

In [135]:
?[count_unique(to)] := :contain['EU', fr],
                       :route{fr, to},
                       :airport{code: to, country: 'US'}


Unnamed: 0,count_unique(to)
0,45


How many routes are there for each airport in London, UK?

In [136]:
?[code, count(code)] := :airport{code, city: 'London', region: 'GB-ENG'}, :route{fr: code}

Unnamed: 0,code,count(code)
0,LCY,51
1,LGW,232
2,LHR,221
3,LTN,130
4,STN,211


We need to specify the region, because there is another city called London, not in the UK.

How many airports are reachable from London, UK in two hops?

In [137]:
lon_uk_airports[code] := :airport{code, city: 'London', region: 'GB-ENG'}
one_hop[to] := lon_uk_airports[fr], :route{fr, to}, not lon_uk_airports[to];
?[count_unique(a3)] := one_hop[a2], :route{fr: a2, to: a3}, not lon_uk_airports[a3];

Unnamed: 0,count_unique(a3)
0,2353


What are the cities directly reachable from LGW, but furthermost away?

In [140]:
?[city, dist] := :route{fr: 'LGW', to, dist},
                 :airport{code: to, city}
:order -dist
:limit 10

Unnamed: 0,city,dist
0,Buenos Aires,6908.0
1,Singapore,6751.0
2,Langkawi,6299.0
3,Duong Dong,6264.0
4,Taipei,6080.0
5,Port Louis,6053.0
6,Rayong,6008.0
7,Cape Town,5987.0
8,Hong Kong,5982.0
9,Shanghai,5745.0


What airports are within 0.1 degrees of the Greenwich meridian?

In [144]:
?[code, desc, lon, lat] := :airport{lon, lat, code, desc}, lon > -0.1, lon < 0.1

Unnamed: 0,code,desc,lon,lat
0,CDT,Castellon De La Plana Airport,0.026111,39.999199
1,LCY,London City Airport,0.055278,51.505278
2,LDE,Tarbes-Lourdes-Pyrénées Airport,-0.006439,43.178699
3,LEH,Le Havre Octeville Airport,0.088056,49.533901


Airports in a box drawn around London Heathrow, UK:

In [147]:
h_box[lon, lat] := :airport{code: 'LHR', lon, lat}
?[code, desc] := h_box[lhr_lon, lhr_lat], :airport{code, lon, lat, desc},
                 abs(lhr_lon - lon) < 1, abs(lhr_lat - lat) < 1

Unnamed: 0,code,desc
0,LCY,London City Airport
1,LGW,London Gatwick
2,LHR,London Heathrow
3,LTN,London Luton Airport
4,SOU,Southampton Airport
5,STN,London Stansted Airport


For some spherical geometry: what is the angle subtended by SFO and NRT on the surface of the earth?

In [153]:
?[deg_diff] := :airport{code: 'SFO', lat: a_lat, lon: a_lon},
               :airport{code: 'NRT', lat: b_lat, lon: b_lon},
               deg_diff = rad_to_deg(haversine_deg_input(a_lat, a_lon, b_lat, b_lon))

Unnamed: 0,deg_diff
0,73.992112


We mentioned before that aggregations in Cozo are powerful. More powerful than in traditional SQL databases. The power comes from the fact that aggregations can be used in recursions (some restrictions apply).

Let's say we want to find the distance of the _shortest route_ between two airports. One way to calculate is to enumerate all the routes between the two airports, and then apply `min` aggregation to the results. This cannot be implemented as stated, since the routes may contain cycles and hence there can be an infinite number of routes between two airports.

Instead, let's think recursively. If we already have all the shortest routes between all nodes, can we derive an _equation_ satisfied by the shortest route? Yes, the shortest route between `a` and `b` is either the distance of a direct route or the sum of the shortest distance from `a` to `c` and the distance of a direct route from `c` to `d`. We apply our `min` aggregation to this recursive set instead. 

Let's write it out and try to find the shortest route between the airports `LHR` and `YPO`:

In [120]:
shortest[b, min(dist)] := :route{fr: 'LHR', to: b, dist} 
                          # Start with the airport 'LHR', retrieve a direct route from 'LHR' to b

shortest[b, min(dist)] := shortest[c, d1], # Start with an existing shortest route from 'LHR' to c
                          :route{fr: c, to: b, dist: d2},  # Retrieve a direct route from c to b
                          dist = d1 + d2 # Add the distances

?[dist] := shortest['YPO', dist] # Extract the answer for 'YPO'. 
                                 # We chose it since it is the hardest airport to get to from 'LHR'.

Unnamed: 0,dist
0,4147.0


It works. Since path-finding is such a common operation on graphs, Cozo has several fixed rules for that:

In [123]:
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ ShortestPathDijkstra(:route[], starting[], goal[])

Unnamed: 0,starting,goal,distance,path
0,LHR,YPO,4147.0,"['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"


Not only is it more efficient, but we also get a path for the shortest route.

Not content with the shortest path, the following calculates the shortest ten paths:

In [125]:
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ KShortestPathYen(:route[], starting[], goal[], k: 10)

Unnamed: 0,starting,goal,distance,path
0,LHR,YPO,4147.0,"['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
1,LHR,YPO,4150.0,"['LHR', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
2,LHR,YPO,4164.0,"['LHR', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
3,LHR,YPO,4167.0,"['LHR', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
4,LHR,YPO,4187.0,"['LHR', 'MAN', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
5,LHR,YPO,4202.0,"['LHR', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
6,LHR,YPO,4204.0,"['LHR', 'MAN', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
7,LHR,YPO,4209.0,"['LHR', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
8,LHR,YPO,4211.0,"['LHR', 'MAN', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
9,LHR,YPO,4212.0,"['LHR', 'DUB', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"


On the other hand, if efficiency is really important to you, you can use the A* algorithm with a really good heuristic function:

In [127]:
code_lat_lon[code, lat, lon] := :airport{code, lat, lon}
starting[code, lat, lon] := code = 'LHR', :airport{code, lat, lon};
goal[code, lat, lon] := code = 'YPO', :airport{code, lat, lon};
?[] <~ ShortestPathAStar(:route[], code_lat_lon[node, lat1, lon1], starting[], goal[goal, lat2, lon2], heuristic: haversine_deg_input(lat1, lon1, lat2, lon2) * 3963);

Unnamed: 0,0,1,2,3
0,LHR,YPO,4147.0,"['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"


There's a lot more setup required in this case: we need to retrieve the latitudes and longitudes of airports and do processing on them first.
The number `3963` above is the radius of the earth in miles. 
See [here](https://cozodb.github.io/current/manual/algorithms.html#Algo.ShortestPathAStar) for what is going on.

The most important airports, by PageRank:

In [158]:
rank[code, score] <~ PageRank(:route[a, b])
?[code, desc, score] := rank[code, score], :airport{code, desc}

:limit 10;
:order -score

Unnamed: 0,code,desc,score
0,FRA,Frankfurt am Main,1.265292
1,IST,Istanbul International Airport,1.260846
2,CDG,Paris Charles de Gaulle,1.251049
3,AMS,Amsterdam Airport Schiphol,1.243261
4,MUC,Munich International Airport,1.230537
5,ORD,Chicago O'Hare International Airport,1.220283
6,DFW,Dallas/Fort Worth International Airport,1.208827
7,DXB,Dubai International Airport,1.20843
8,PEK,Beijing Capital International Airport,1.208074
9,ATL,Hartsfield - Jackson Atlanta International Airport,1.199858


The following example takes a long time to run since it calculates the betweenness centrality.
Algorithms for calculating the betweenness centrality have high complexity.

In [157]:
centrality[code, score] <~ BetweennessCentrality(:route[a, b])
?[code, desc, score] := centrality[code, score], :airport{code, desc}

:limit 10;
:order -score

Unnamed: 0,code,desc,score
0,ANC,Anchorage Ted Stevens,1074869.260952
1,KEF,"Reykjavik, Keflavik International Airport",928449.975037
2,HEL,Helsinki Ventaa,581588.490562
3,PEK,Beijing Capital International Airport,532020.4253
4,DEL,Indira Gandhi International Airport,472979.963291
5,IST,Istanbul International Airport,457882.076744
6,PKC,Yelizovo Airport,408571.027619
7,MSP,Minneapolis-St.Paul International Airport,396433.049206
8,LAX,Los Angeles International Airport,393310.114286
9,DEN,Denver International Airport,374339.835975


These are the airports that, if disconnected from the network, cause the most disruption.

That's it for the tutorial. Continue with the [Manual](https://cozodb.github.io/current/manual/index.html) if you want more details.