# The distillation

In [1]:
%reload_ext pycozo.ipyext_direct
%cozo_auth tutorial *******

Welcome back! You already know how to use simple Datalog queries and stored relations in Cozo, and you have learned the intricacies of schema-based triple stores. Today we are going to learn about aggregations and algorithms.

Before we start, we need to get some data into the database so that we can play with them. Instead of sesame-seed-sized inline data we used the last few times, today we are moving towards peanut-sized data. The data we are going to use, and many examples that we will present, are adapted from the book [Practical Gremlin](https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html), which teaches the Gremlin graph query language, a very different, imperative take on graphs (Datalog, by constrast, is declarative). It is always a good idea to explore different options for your problem and to decide for yourself which tool is best for you.

We start by defining the schema we need:

In [2]:
:schema

:put country {
    code: string unique,
    desc: string
}

:put continent {
    code: string unique,
    desc: string
}

:put airport {
    iata: string unique,
    icao: string index,
    city: string index,
    desc: string,
    region: string index,
    country: ref,
    runways: int,
    longest: int,
    altitude: int,
    lat: float,
    lon: float
}

:put route {
    src: ref,
    dst: ref,
    distance: int
}

:put geo {
    contains: ref many,
}

Unnamed: 0,attr_id,op
0,10000011,assert
1,10000012,assert
2,10000013,assert
3,10000014,assert
4,10000015,assert
5,10000016,assert
6,10000017,assert
7,10000018,assert
8,10000019,assert
9,10000020,assert


We intend the entities to be countries, continents, airports and routes. The attribute `geo.contains` denotes geographical inclusion. In our case, the `src` and `dst` of a `route` are always airport entities. Airports are uniquely identified by their `iata` code, and contain a slew of other attributes including latitudes and longitudes.

Now download the data, look over it to see what it contains, put it somewhere on your hard drive (we recommend next to the `cozoserver` executable so that the following script works verbatim) and run:

In [3]:
:db execute '../tests/air-routes-data.json'

Unnamed: 0,asserts,retracts
0,197646,0


The execution should not take to long. When it's done, we are set.

> Though peanut-sized by today's standard, the data still contains over 61k lines of JSON objects, some of which are quite long lines (yes, each line in the tx script is a valid JSON object), and it seems that the Python libraries we used to write the extension can't quite handle it. If you use the IPython magic `%%cozo_run_file` to run it, your python process will likely hang.

## Exploratory data analysis

The data is new to us. First we need to see what it looks like. Let's start with airports.

In [4]:
?[iata, city, desc, region, runways, lat, lon] := 
    [a airport.iata iata],
    [a airport.city city],
    [a airport.desc desc],
    [a airport.region region],
    [a airport.runways runways],
    [a airport.lat lat],
    [a airport.lon lon]
    
:limit 5

Unnamed: 0,iata,city,desc,region,runways,lat,lon
0,ANC,Anchorage,Anchorage Ted Stevens,US-AK,3,61.1744,-149.996002
1,ATL,Atlanta,Hartsfield - Jackson Atlanta International Airport,US-GA,5,33.6367,-84.428101
2,AUS,Austin,Austin Bergstrom International Airport,US-TX,2,30.1945,-97.669899
3,BNA,Nashville,Nashville International Airport,US-TN,4,36.1245,-86.6782
4,BOS,Boston,Boston Logan,US-MA,6,42.3643,-71.005203


The only notable thing about this query is that we used the `:limit` option to limit the number of output rows. If we did not put it in, thousands of rows will be returned and your browser may not like it. The `:offset` option is also available:

In [5]:
?[iata, city, desc, region, runways, lat, lon] := 
    [a airport.iata iata],
    [a airport.city city],
    [a airport.desc desc],
    [a airport.region region],
    [a airport.runways runways],
    [a airport.lat lat],
    [a airport.lon lon]

:offset 3
:limit 2

Unnamed: 0,iata,city,desc,region,runways,lat,lon
0,BNA,Nashville,Nashville International Airport,US-TN,4,36.1245,-86.6782
1,BOS,Boston,Boston Logan,US-MA,6,42.3643,-71.005203


There is a subtle point here: when you specify `:limit`, the database is constrained to return only that many rows to you. But _which_ rows it gives you is not specified (for performance reasons). In our case, even though the first returned IATA is ANC, that doesn't mean the smallest IATA is ANC (the output is sorted, yes, but only among the rows themselves). In fact, the query didn't even look at all the rows, since it can already satisfy what you ask it for by looking only at five rows!

If you want "global" sorting for your results before applying `:limit`, you have to ask for it and the database will be forced to look at all the data:

In [6]:
?[iata, city, desc, region, runways, lat, lon] := 
    [a airport.iata iata],
    [a airport.city city],
    [a airport.desc desc],
    [a airport.region region],
    [a airport.runways runways],
    [a airport.lat lat],
    [a airport.lon lon]
    
:limit 5
:order iata

Unnamed: 0,iata,city,desc,region,runways,lat,lon
0,AAA,Anaa,Anaa Airport,PF-U-A,1,-17.3526,-145.509995
1,AAE,Annabah,Annaba Airport,DZ-36,2,36.822201,7.80917
2,AAL,Aalborg,Aalborg Airport,DK-81,2,57.092759,9.849243
3,AAN,Al Ain,Al Ain International Airport,AE-AZ,1,24.2617,55.6092
4,AAQ,Anapa,Anapa Airport,RU-KDA,1,45.002102,37.347301


You can also sort in descending order (by prefixing the sorted column name by the minus sign), or sort by multiple columns:

In [7]:
?[iata, city, desc, region, runways, lat, lon] := 
    [a airport.iata iata],
    [a airport.city city],
    [a airport.desc desc],
    [a airport.region region],
    [a airport.runways runways],
    [a airport.lat lat],
    [a airport.lon lon]
    
:limit 10
:order -runways, -city

Unnamed: 0,iata,city,desc,region,runways,lat,lon
0,DFW,Dallas,Dallas/Fort Worth International Airport,US-TX,7,32.896801,-97.038002
1,ORD,Chicago,Chicago O'Hare International Airport,US-IL,7,41.9786,-87.9048
2,DTW,Detroit,"Detroit Metropolitan, Wayne County",US-MI,6,42.212399,-83.353401
3,DEN,Denver,Denver International Airport,US-CO,6,39.861698,-104.672997
4,BOS,Boston,Boston Logan,US-MA,6,42.3643,-71.005203
5,AMS,Amsterdam,Amsterdam Airport Schiphol,NL-NH,6,52.308601,4.76389
6,UFA,Ufa,Ufa International Airport,RU-BA,5,54.557499,55.874401
7,YYZ,Toronto,Toronto Pearson International Airport,CA-ON,5,43.6772,-79.6306
8,TRG,Tauranga,Tauranga Airport,NZ-BOP,5,-37.671902,176.195999
9,SNN,Shannon,Shannon Airport,IE-CE,5,52.702,-8.92482


The above query finds the airports with the most runways, sorted by their city in reverse alphabetical order.

Of course, the first question when we have new data is "how many rows". We delayed answering this question since it requires aggregation (technically you can do it with aggregation since the query language we learned in the first tutorial is already Turing complete. But you need to get back lots of irrelevant stuff together with the count if you do it that way. Turing machines are not efficient). Here it is, how to count:

In [8]:
?[count(a)] := [a airport.iata iata]
:order count(a)

Unnamed: 0,count(a)
0,3504


The body of the rule is simple: we asked for all triples with the unique attribute `airport.iata`. But the aggregation `count` is applied to the _head_ of the rule instead of within the rule body.

We can mix aggregated head symbols with non-aggregates:

In [10]:
?[count(initial), initial] := [ct airport.iata iata], initial = first(chars(iata))

:order initial

Unnamed: 0,count(initial),initial
0,212,A
1,235,B
2,214,C
3,116,D
4,95,E
5,76,F
6,135,G
7,129,H
8,112,I
9,80,J


This gives you the number of airports with different initials. Any non-aggregated symbols in the head acts as grouping variables (similar to `group by` in SQL).

Another caveat lies here. Usually you can break a rule body into smaller parts by introducing other rules. But if we naively try to "refactor" the above query, we get nonsensical results:

In [11]:
initials[i] := [_ airport.iata iata], i = first(chars(iata))
?[count(initial), initial] := initials[initial]

:order initial

Unnamed: 0,count(initial),initial
0,1,A
1,1,B
2,1,C
3,1,D
4,1,E
5,1,F
6,1,G
7,1,H
8,1,I
9,1,J


What's happening? Remember that Cozo Datalog operates with set semantics instead of bag semantics. So in the first rule, the results are already de-duplicated. But for aggregations like `count`, counting must be done with bag semantics. In fact, if the first rule can _disambiguate_ the duplicates, you get the old results:

In [12]:
initials[i, iata] := [_ airport.iata iata], i = first(chars(iata))
?[count(initial), initial] := initials[initial, _]

:order initial

Unnamed: 0,count(initial),initial
0,212,A
1,235,B
2,214,C
3,116,D
4,95,E
5,76,F
6,135,G
7,129,H
8,112,I
9,80,J


There are many aggregate functions in Cozo, most of them should be quite familiar for anyone fluent in SQL. For example, the following calculates the statistics for runways:

In [13]:
?[count(r), count_unique(r), sum(r), min(r), max(r), mean(r), std_dev(r)] := 
    [a airport.runways r]

Unnamed: 0,count(r),count_unique(r),sum(r),min(r),max(r),mean(r),std_dev(r)
0,3504,7,4980.0,1,7,1.421233,0.743083


## Recursive aggregations

Much of the power of Datalog comes from its recursive rules. But with aggregations, recursion can be disallowed even without negation:

In [14]:
what[sum(r)] := [a airport.runways r]
what[sum(r)] := what[r]
?[r] := what[r]

[31meval::unstratifiable[0m

  [31m√ó[0m Query is unstratifiable
[36m  help: [0mThe rule 'what' is in the strongly connected component ["what"],
        and is involved in at least one forbidden dependency
        (negation, non-meet aggregation, or algorithm-application).


The compiler is right to reject the query since there is no meaningful interpretation for it. But sometimes there is. Let's see an example.

We want to find the distance of the _shortest route_ between two airports. One way to calculate is to enumerate all the routes between the two airports, and then apply `min` aggregation to the results. This cannot be implemented as stated, since the routes may contain cycles and hence there can be an infinite number of routes between two airports.

Instead, let's think recursively. If we already have all the shortest routes between all nodes, can we derive an _equation_ satisfied by the shortest route? Yes, A shortest route between `a` and `b` is either the distance of a direct route, or the sum of the shortest distance from `a` to `c` and the distance of a direct route from `c` to `d`. We apply our `min` aggregation to this recursive set instead. Let's write it out and try to find the shortest route between `LHR` and `YPO`:

In [None]:
shortest[b, min(dist)] := [a airport.iata 'LHR'], # Start with the airport 'LHR'
                          [r route.src a], [r route.dst b], [r route.distance dist] # Retrive a direct route from 'LHR' to b

shortest[b, min(dist)] := shortest[c, d1], # Start with an existing shortest route from 'LHR' to c
                          [r route.src c], [r route.dst b], [r route.distance d2],  # Retrieve a direct route from c to b
                          dist <- d1 + d2 # Add the distances

?[dist] := [a airport.iata 'YPO'], shortest[a, dist] # Extract the answer for 'YPO'. 
                                                     # We chose it since it is the hardest airport to get to from 'LHR'.

Unnamed: 0,dist
0,4147


The surprise is that the compiler actually accepts this program and gives the correct answer for it! So there must be a fundamental difference between the `min` and `count` aggregations.

What is it then? We actually gave a hint above when we discussed the importance of applying set instead of bag semantics for `count`. For `min`, it doesn't matter which semantics you apply. The final result is the same either way.

Mathematically, we say that `min` is a _meet operation_ satisfying commutativity, distributivity and idempotency. In Cozo, recursion through meet aggregations is allowed since the minimum fixed-point semantics can be extended to meet operations (if a rule contains several aggregations, all must be meet operations for it to be eligible for recursion).

By the way, there are much better and much faster ways to look for shortest routes. We will learn these later. The point of this example is that recursive aggregation is a very general construct that is enormously powerful. Tricky problems that in other databases require pulling all the data to the client and processing them in a general programming language can usually be solved by apt applications of recursive aggregations.

## Algorithms

Cozo's version of Datalog is already Turing-complete, yet we need aggregations for things like counting to be practically feasible and useful. In the same vein, any conceivable algorithm can be implemented with what we already have, but the implementation may be too complicated and inefficient to be of practical use.

Cozo claimed to be a graph-focused database. There are common operations we want to do on graphs that are just awkward to do with Datalog (or any general purpose query language, such as SQL). The code for the shortest path example we gave above is actually not too bad. For algorithms like PageRank it can get much worse.

In Cozo we take a pragmatic approach and introduce _algorithms_. They can be thought of as black-box rules that take in existing relations and produce a new relation according to its specification. For the shortest path, the appropriate algorithm to use is Dijkstra's algorithm:

In [16]:
paths[fr, to, dist] := [r route.src fr_a], [r route.dst to_a], [r route.distance dist], [fr_a airport.iata fr], [to_a airport.iata to]
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ ShortestPathDijkstra(paths[], starting[], goal[])

Unnamed: 0,starting,goal,distance,path
0,LHR,YPO,4147.0,"['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"


Algorithm application is indicated by the `<~` symbol separating the rule head and rule body. As for constant rules, rule head bindings can be omitted. The algorithm is then called like a function, but taking in relations as arguments. Above we have used three relations we defined inline. For stored relations, the notation is `:stored_relation`.

Some algorithms take in additional arguments. The following example calculates the shortest path to the same problem, but returns the ten shortest paths instead:

In [17]:
paths[fr, to, dist] := [r route.src fr_a], [r route.dst to_a], [r route.distance dist], [fr_a airport.iata fr], [to_a airport.iata to]
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ KShortestPathYen(paths[], starting[], goal[], k: 10)

Unnamed: 0,starting,goal,distance,path
0,LHR,YPO,4147.0,"['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
1,LHR,YPO,4150.0,"['LHR', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
2,LHR,YPO,4164.0,"['LHR', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
3,LHR,YPO,4167.0,"['LHR', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
4,LHR,YPO,4187.0,"['LHR', 'MAN', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
5,LHR,YPO,4202.0,"['LHR', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
6,LHR,YPO,4204.0,"['LHR', 'MAN', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
7,LHR,YPO,4209.0,"['LHR', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
8,LHR,YPO,4211.0,"['LHR', 'MAN', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"
9,LHR,YPO,4212.0,"['LHR', 'DUB', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']"


As we can see, in addition to relations as arguments, an algorithm can also take _parameters_, `k` in this case.