{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The distillation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext pycozo.ipyext_direct\n", "%cozo_auth tutorial *******" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome back! You already know how to use simple Datalog queries and stored relations in Cozo, and you have learned the intricacies of schema-based triple stores. Today we are going to learn about aggregations and algorithms.\n", "\n", "Before we start, we need to get some data into the database so that we can play with them. Instead of sesame-seed-sized inline data we used the last few times, today we are moving towards peanut-sized data. The data we are going to use, and many examples that we will present, are adapted from the book [Practical Gremlin](https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html), which teaches the Gremlin graph query language, a very different, imperative take on graphs (Datalog, by constrast, is declarative). It is always a good idea to explore different options for your problem and to decide for yourself which tool is best for you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by defining the schema we need:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 attr_idop
010000011assert
110000012assert
210000013assert
310000014assert
410000015assert
510000016assert
610000017assert
710000018assert
810000019assert
910000020assert
1010000021assert
1110000022assert
1210000023assert
1310000024assert
1410000025assert
1510000026assert
1610000027assert
1710000028assert
1810000029assert
\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ ":schema\n", "\n", ":put country {\n", " code: string unique,\n", " desc: string\n", "}\n", "\n", ":put continent {\n", " code: string unique,\n", " desc: string\n", "}\n", "\n", ":put airport {\n", " iata: string unique,\n", " icao: string index,\n", " city: string index,\n", " desc: string,\n", " region: string index,\n", " country: ref,\n", " runways: int,\n", " longest: int,\n", " altitude: int,\n", " lat: float,\n", " lon: float\n", "}\n", "\n", ":put route {\n", " src: ref,\n", " dst: ref,\n", " distance: int\n", "}\n", "\n", ":put geo {\n", " contains: ref many,\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We intend the entities to be countries, continents, airports and routes. The attribute `geo.contains` denotes geographical inclusion. In our case, the `src` and `dst` of a `route` are always airport entities. Airports are uniquely identified by their `iata` code, and contain a slew of other attributes including latitudes and longitudes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now download the data, look over it to see what it contains, put it somewhere on your hard drive (we recommend next to the `cozoserver` executable so that the following script works verbatim) and run:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 assertsretracts
01976460
\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ ":db execute '../tests/air-routes-data.json'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The execution should not take to long. When it's done, we are set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Though peanut-sized by today's standard, the data still contains over 61k lines of JSON objects, some of which are quite long lines (yes, each line in the tx script is a valid JSON object), and it seems that the Python libraries we used to write the extension can't quite handle it. If you use the IPython magic `%%cozo_run_file` to run it, your python process will likely hang." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory data analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is new to us. First we need to see what it looks like. Let's start with airports." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 iatacitydescregionrunwayslatlon
0ANCAnchorageAnchorage Ted StevensUS-AK361.174400-149.996002
1ATLAtlantaHartsfield - Jackson Atlanta International AirportUS-GA533.636700-84.428101
2AUSAustinAustin Bergstrom International AirportUS-TX230.194500-97.669899
3BNANashvilleNashville International AirportUS-TN436.124500-86.678200
4BOSBostonBoston LoganUS-MA642.364300-71.005203
\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?[iata, city, desc, region, runways, lat, lon] := \n", " [a airport.iata iata],\n", " [a airport.city city],\n", " [a airport.desc desc],\n", " [a airport.region region],\n", " [a airport.runways runways],\n", " [a airport.lat lat],\n", " [a airport.lon lon]\n", " \n", ":limit 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The only notable thing about this query is that we used the `:limit` option to limit the number of output rows. If we did not put it in, thousands of rows will be returned and your browser may not like it. The `:offset` option is also available:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 iatacitydescregionrunwayslatlon
0BNANashvilleNashville International AirportUS-TN436.124500-86.678200
1BOSBostonBoston LoganUS-MA642.364300-71.005203
\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?[iata, city, desc, region, runways, lat, lon] := \n", " [a airport.iata iata],\n", " [a airport.city city],\n", " [a airport.desc desc],\n", " [a airport.region region],\n", " [a airport.runways runways],\n", " [a airport.lat lat],\n", " [a airport.lon lon]\n", "\n", ":offset 3\n", ":limit 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a subtle point here: when you specify `:limit`, the database is constrained to return only that many rows to you. But _which_ rows it gives you is not specified (for performance reasons). In our case, even though the first returned IATA is ANC, that doesn't mean the smallest IATA is ANC (the output is sorted, yes, but only among the rows themselves). In fact, the query didn't even look at all the rows, since it can already satisfy what you ask it for by looking only at five rows!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want \"global\" sorting for your results before applying `:limit`, you have to ask for it and the database will be forced to look at all the data:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 iatacitydescregionrunwayslatlon
0AAAAnaaAnaa AirportPF-U-A1-17.352600-145.509995
1AAEAnnabahAnnaba AirportDZ-36236.8222017.809170
2AALAalborgAalborg AirportDK-81257.0927599.849243
3AANAl AinAl Ain International AirportAE-AZ124.26170055.609200
4AAQAnapaAnapa AirportRU-KDA145.00210237.347301
\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?[iata, city, desc, region, runways, lat, lon] := \n", " [a airport.iata iata],\n", " [a airport.city city],\n", " [a airport.desc desc],\n", " [a airport.region region],\n", " [a airport.runways runways],\n", " [a airport.lat lat],\n", " [a airport.lon lon]\n", " \n", ":limit 5\n", ":order iata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also sort in descending order (by prefixing the sorted column name by the minus sign), or sort by multiple columns:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 iatacitydescregionrunwayslatlon
0DFWDallasDallas/Fort Worth International AirportUS-TX732.896801-97.038002
1ORDChicagoChicago O'Hare International AirportUS-IL741.978600-87.904800
2DTWDetroitDetroit Metropolitan, Wayne CountyUS-MI642.212399-83.353401
3DENDenverDenver International AirportUS-CO639.861698-104.672997
4BOSBostonBoston LoganUS-MA642.364300-71.005203
5AMSAmsterdamAmsterdam Airport SchipholNL-NH652.3086014.763890
6UFAUfaUfa International AirportRU-BA554.55749955.874401
7YYZTorontoToronto Pearson International AirportCA-ON543.677200-79.630600
8TRGTaurangaTauranga AirportNZ-BOP5-37.671902176.195999
9SNNShannonShannon AirportIE-CE552.702000-8.924820
\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?[iata, city, desc, region, runways, lat, lon] := \n", " [a airport.iata iata],\n", " [a airport.city city],\n", " [a airport.desc desc],\n", " [a airport.region region],\n", " [a airport.runways runways],\n", " [a airport.lat lat],\n", " [a airport.lon lon]\n", " \n", ":limit 10\n", ":order -runways, -city" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above query finds the airports with the most runways, sorted by their city in reverse alphabetical order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, the first question when we have new data is \"how many rows\". We delayed answering this question since it requires aggregation (technically you can do it with aggregation since the query language we learned in the first tutorial is already Turing complete. But you need to get back lots of irrelevant stuff together with the count if you do it that way. Turing machines are not efficient). Here it is, how to count:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 count(a)
03504
\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?[count(a)] := [a airport.iata iata]\n", ":order count(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The body of the rule is simple: we asked for all triples with the unique attribute `airport.iata`. But the aggregation `count` is applied to the _head_ of the rule instead of within the rule body." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can mix aggregated head symbols with non-aggregates:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 count(initial)initial
0212A
1235B
2214C
3116D
495E
576F
6135G
7129H
8112I
980J
10197K
11184L
12228M
13111N
1489O
15203P
167Q
17121R
18245S
19205T
2077U
2186V
2259W
2328X
24211Y
2549Z
\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?[count(initial), initial] := [ct airport.iata iata], initial = first(chars(iata))\n", "\n", ":order initial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives you the number of airports with different initials. Any non-aggregated symbols in the head acts as grouping variables (similar to `group by` in SQL)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another caveat lies here. Usually you can break a rule body into smaller parts by introducing other rules. But if we naively try to \"refactor\" the above query, we get nonsensical results:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 count(initial)initial
01A
11B
21C
31D
41E
51F
61G
71H
81I
91J
101K
111L
121M
131N
141O
151P
161Q
171R
181S
191T
201U
211V
221W
231X
241Y
251Z
\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initials[i] := [_ airport.iata iata], i = first(chars(iata))\n", "?[count(initial), initial] := initials[initial]\n", "\n", ":order initial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's happening? Remember that Cozo Datalog operates with set semantics instead of bag semantics. So in the first rule, the results are already de-duplicated. But for aggregations like `count`, counting must be done with bag semantics. In fact, if the first rule can _disambiguate_ the duplicates, you get the old results:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 count(initial)initial
0212A
1235B
2214C
3116D
495E
576F
6135G
7129H
8112I
980J
10197K
11184L
12228M
13111N
1489O
15203P
167Q
17121R
18245S
19205T
2077U
2186V
2259W
2328X
24211Y
2549Z
\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initials[i, iata] := [_ airport.iata iata], i = first(chars(iata))\n", "?[count(initial), initial] := initials[initial, _]\n", "\n", ":order initial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many aggregate functions in Cozo, most of them should be quite familiar for anyone fluent in SQL. For example, the following calculates the statistics for runways:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 count(r)count_unique(r)sum(r)min(r)max(r)mean(r)std_dev(r)
0350474980.000000171.4212330.743083
\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?[count(r), count_unique(r), sum(r), min(r), max(r), mean(r), std_dev(r)] := \n", " [a airport.runways r]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recursive aggregations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Much of the power of Datalog comes from its recursive rules. But with aggregations, recursion can be disallowed even without negation:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[31meval::unstratifiable\u001b[0m\n", "\n", " \u001b[31m×\u001b[0m Query is unstratifiable\n", "\u001b[36m help: \u001b[0mThe rule 'what' is in the strongly connected component [\"what\"],\n", " and is involved in at least one forbidden dependency\n", " (negation, non-meet aggregation, or algorithm-application).\n" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "what[sum(r)] := [a airport.runways r]\n", "what[sum(r)] := what[r]\n", "?[r] := what[r]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The compiler is right to reject the query since there is no meaningful interpretation for it. But sometimes there is. Let's see an example.\n", "\n", "We want to find the distance of the _shortest route_ between two airports. One way to calculate is to enumerate all the routes between the two airports, and then apply `min` aggregation to the results. This cannot be implemented as stated, since the routes may contain cycles and hence there can be an infinite number of routes between two airports.\n", "\n", "Instead, let's think recursively. If we already have all the shortest routes between all nodes, can we derive an _equation_ satisfied by the shortest route? Yes, A shortest route between `a` and `b` is either the distance of a direct route, or the sum of the shortest distance from `a` to `c` and the distance of a direct route from `c` to `d`. We apply our `min` aggregation to this recursive set instead. Let's write it out and try to find the shortest route between `LHR` and `YPO`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 dist
04147
\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shortest[b, min(dist)] := [a airport.iata 'LHR'], # Start with the airport 'LHR'\n", " [r route.src a], [r route.dst b], [r route.distance dist] # Retrive a direct route from 'LHR' to b\n", "\n", "shortest[b, min(dist)] := shortest[c, d1], # Start with an existing shortest route from 'LHR' to c\n", " [r route.src c], [r route.dst b], [r route.distance d2], # Retrieve a direct route from c to b\n", " dist <- d1 + d2 # Add the distances\n", "\n", "?[dist] := [a airport.iata 'YPO'], shortest[a, dist] # Extract the answer for 'YPO'. \n", " # We chose it since it is the hardest airport to get to from 'LHR'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The surprise is that the compiler actually accepts this program and gives the correct answer for it! So there must be a fundamental difference between the `min` and `count` aggregations.\n", "\n", "What is it then? We actually gave a hint above when we discussed the importance of applying set instead of bag semantics for `count`. For `min`, it doesn't matter which semantics you apply. The final result is the same either way.\n", "\n", "Mathematically, we say that `min` is a _meet operation_ satisfying commutativity, distributivity and idempotency. In Cozo, recursion through meet aggregations is allowed since the minimum fixed-point semantics can be extended to meet operations (if a rule contains several aggregations, all must be meet operations for it to be eligible for recursion).\n", "\n", "By the way, there are much better and much faster ways to look for shortest routes. We will learn these later. The point of this example is that recursive aggregation is a very general construct that is enormously powerful. Tricky problems that in other databases require pulling all the data to the client and processing them in a general programming language can usually be solved by apt applications of recursive aggregations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cozo's version of Datalog is already Turing-complete, yet we need aggregations for things like counting to be practically feasible and useful. In the same vein, any conceivable algorithm can be implemented with what we already have, but the implementation may be too complicated and inefficient to be of practical use.\n", "\n", "Cozo claimed to be a graph-focused database. There are common operations we want to do on graphs that are just awkward to do with Datalog (or any general purpose query language, such as SQL). The code for the shortest path example we gave above is actually not too bad. For algorithms like PageRank it can get much worse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Cozo we take a pragmatic approach and introduce _algorithms_. They can be thought of as black-box rules that take in existing relations and produce a new relation according to its specification. For the shortest path, the appropriate algorithm to use is Dijkstra's algorithm:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 startinggoaldistancepath
0LHRYPO4147.000000['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "paths[fr, to, dist] := [r route.src fr_a], [r route.dst to_a], [r route.distance dist], [fr_a airport.iata fr], [to_a airport.iata to]\n", "starting[] <- [['LHR']]\n", "goal[] <- [['YPO']]\n", "?[starting, goal, distance, path] <~ ShortestPathDijkstra(paths[], starting[], goal[])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Algorithm application is indicated by the `<~` symbol separating the rule head and rule body. As for constant rules, rule head bindings can be omitted. The algorithm is then called like a function, but taking in relations as arguments. Above we have used three relations we defined inline. For stored relations, the notation is `:stored_relation`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some algorithms take in additional arguments. The following example calculates the shortest path to the same problem, but returns the ten shortest paths instead:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 startinggoaldistancepath
0LHRYPO4147.000000['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
1LHRYPO4150.000000['LHR', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
2LHRYPO4164.000000['LHR', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
3LHRYPO4167.000000['LHR', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
4LHRYPO4187.000000['LHR', 'MAN', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
5LHRYPO4202.000000['LHR', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
6LHRYPO4204.000000['LHR', 'MAN', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
7LHRYPO4209.000000['LHR', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
8LHRYPO4211.000000['LHR', 'MAN', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
9LHRYPO4212.000000['LHR', 'DUB', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "paths[fr, to, dist] := [r route.src fr_a], [r route.dst to_a], [r route.distance dist], [fr_a airport.iata fr], [to_a airport.iata to]\n", "starting[] <- [['LHR']]\n", "goal[] <- [['YPO']]\n", "?[starting, goal, distance, path] <~ KShortestPathYen(paths[], starting[], goal[], k: 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, in addition to relations as arguments, an algorithm can also take _parameters_, `k` in this case." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }