main
Ziyang Hu 2 years ago
parent 66ae839b0d
commit ff239dddd3

@ -52,32 +52,14 @@ So what is essential about these relational databases that has earned them such
But the NoSQL movement did occur, and with good reasons: relational databases fail in some ways. Every person has perhaps their own list of perceived shortcomings of relational databases, such as (the old relational systems') inability of dealing with the Big Data that comes with the explosion of the Internet. One of them is particularly unfortunate, however: the claim that relational databases are just bad with graph data. This accusation is particularly acute in the age of social networks. However, "graphs", "networks" and "relationships" are kind of synonyms, and "relational" is even in the name of relational algebra! In fact, relational algebra itself is perfectly capable of dealing with graph structures, and with recursion introduced, traditional relational databases can be no less powerful than dedicated graph databases.
If relational algebra itself is not a real obstacle, why are many graph databases "going beyond" it, and in the process throwing away the closure property, which in practice makes the data stored much harder to use beyond the business logic originally envisioned? We think SQL is to blame. The syntax is kind of backward (it really logically should be "FROM-WHERE-SELECT" rather than the traditional "SELECT-FROM-WHERE", both humans and auto-completions have to mentally reorder as a consequence), inline nesting is hard to read and has corner cases (certain types of "correlated queries" which in fact cannot be expressed in relational algebra), common table expressions are clunky and escalate quickly to unreadability when recursion is thrown in. And nesting, joins, and recursion are essential for graphs. In this day, using SQL for querying graphs feels like using FORTRAN for scripting webpages.
If relational algebra itself is not a real obstacle, why are many graph databases "going beyond" it, and in the process throwing away the closure property, which in practice makes the data stored much harder to use beyond the business logic originally envisioned? We think SQL is to blame. The syntax is kind of backward (it really logically should be "FROM-WHERE-SELECT" rather than the traditional "SELECT-FROM-WHERE", both humans and auto-completions have to mentally reorder as a consequence), inline nesting is hard to read and has corner cases (certain types of "correlated queries" which in fact cannot be expressed in relational algebra), common table expressions are clunky and escalate quickly to unreadability when recursion is thrown in, and SQL actually differs from relational algebra in a fundamental way by adopting bag instead of set semantics, which is problematic for recursion. As nesting, joins, and recursion are essential for graphs but clumsy and not easy to use in SQL, in this day, using SQL for querying graphs feels like using FORTRAN for scripting webpages.
Datalog is a solution ...
In fact, another much simpler query language has existed for quite some time (since 1986): Datalog, whose non-recursive part is equivalent to relational algebra. It is usually encountered when reading papers in relational database theories, where using SQL for mathematical reasoning is just too unwieldy. Most theoretical books on relational databases even have a chapter or section on Datalog, because it is so simple and "helps one to write SQL correctly". Not many databases support it directly though, a testimony of the fear of "breaking compatibility" and hence losing market. And those databases that support Datalog and available to the public certainly cannot be considered general-purpose databases.
Commercial systems are averse to breaking SQL compatibility ...
This is where Cozo comes in. We want to prove, through a real database implementation, that relational model can be made much simpler and much more pleasant to use if we are prepared to ditch the SQL syntax. Furthermore, by combining the core relational algebra with recursion and aggregation (in a somewhat different way than usually done in SQL), we want to show that relational databases are perfectly capable of dealing with graphs efficiently, with a syntax that is both easy to write and easy to read and understand. How much we have succeeded is up to you, the user, to judge.
## Another database?!
## Non-goals
Every few days a new database comes out and is advertised to be the Next Big Thing. This presents difficulties for users who try to decide which to use for the next project ... well, we actually didn't have any such difficulty 90% of the time: just stick to sqlite or postgres[^1]! For the remaining 10%, though, we are troubled by heavy joins that are too complicated to read, recursive CTEs that are a total pain to write, or mysterious query (anti-)optimizations that require a PhD degree to debug. And these invariably happen when we try to process our data mainly as networks, not tables.
[^1]: Or cassandra if the data is really too big, but cassandra is not as nice to use.
Yeah, we know there are graph databases designed just for this use case. We've used dozens of them at various stages. Some of them use syntax that is an improvement over SQL for simple graph cases but is actually not substantially more expressive for complicated situations. Some of them are super powerful but require you to write semi-imperative code. A few of them are "multi-paradigm" and attempt to support different logical data models simultaneously, with the result that none was supported very well. So we are not very satisfied.
## Goals and non-goals
In a sense, Cozo is our ongoing experiment for building a database that is powerful, reliable and at the same time a joy to use.
* Clean syntax, easy to write and read, even when dealing with convoluted problems.
* Well-defined semantics, even when dealing with recursions.
* Efficient short-cuts for common graph queries.
* No elaborate set-up necessary for ad-hoc queries and explorations.
* Integrity and consistency of data when explicitly required.
At the moment, the following are non-goals (but may change in the future)
* Sharding, distributed system. This would greatly increase the complexity of the system, and we don't think it is appropriate to pour energy into it at this experimentation stage.
* Additional query languages, e.g. SQL, GraphQL.
* Support for more paradigms, e.g. a document store. many a date
* As Cozo is currently considered an experiment, it is probably not going to have distributed functions for quite some time, if ever.
* A feature in traditional RDBMS is the query optimizer. Cozo is not going to have one in the traditional sense for the moment, for two reasons. The first one is that building a good query optimizer takes enormous time, and at the moment we do not want to put our time in implementing one. The second, more fundamental reason is that, even with good query optimizers, like those in PostgreSQL, their usefulness in actually optimizing (instead of de-optimizing) queries decreases exponentially with the number of joins present. And graph queries tend to contain many more joins than non-graph queries. For complex queries, "debugging" the query plan is actually much harder than specifying the plan explicitly (which you cannot do in RDBMS, for some reason). In Cozo the execution order can be determined explicitly from how the query is written: there is no guesswork, and you do not play hide-and-seek with the query planner. We believe that the end user must understand the data sufficiently to efficiently use it, and even a superficial understanding allows one to write a reasonably efficient query. In our experience, the approach taken by traditional RDBMS is akin to a strongly typed programming language disallowing (or heavily discouraging) the programmer to write _any_ type declarations and insisting that all types must be inferred, thus giving its implementers an impossible task. When Cozo becomes more mature, we _may_ introduce query optimizers for limited situations in which they can have large benefits, but explicit specification will always remain an option.
* Cozo is not mature enough to benefit from elaborate account and security subsystems. Currently, Cozo has a required password authentication scheme with no defaults, but it is not considered sufficient for any purpose on the Internet. You should only run Cozo within your trusted network. The current security scheme is only meant to be a last counter-measure to the sorry situations of inadvertently exposing large swathes of data to the Internet.
Loading…
Cancel
Save