diff --git a/DOC/AUTO_CLUSTERING.md b/DOC/AUTO_CLUSTERING.md new file mode 100644 index 00000000..137285e5 --- /dev/null +++ b/DOC/AUTO_CLUSTERING.md @@ -0,0 +1,156 @@ +# Automatic clustering +This document describes various ways to dynamically form rqlite clusters, which is particularly useful for automating your deployment of rqlite. + +> :warning: **This functionality was introduced in version 7.0. It does not exist in earlier releases.** + +## Contents +* [Quickstart](#quickstart) + * [Automatic Boostrapping](#automatic-bootstrapping) + * [Using DNS for Bootstrapping](#using-dns-for-bootstrapping) + * [DNS SRV](#dns-srv) + * [Kubernetes](#kubernetes) + * [Consul](#consul) + * [etcd](#etcd) +* [Next steps](#next-steps) + * [Customizing your configuration](#customizing-your-configuration) + * [Running multiple different clusters](#running-multiple-different-clusters) +* [Design](#design) + +## Quickstart + +### Automatic Bootstrapping +While [manually creating a cluster](https://github.com/rqlite/rqlite/blob/master/DOC/CLUSTER_MGMT.md) is simple, it does suffer one drawback -- you must start one node first and with different options, so it can become the Leader. _Automatic Bootstrapping_, in constrast, allows you to start all the nodes at once, and in a very similar manner. **You just need to know the network addresses of the nodes ahead of time**. + +For simplicity, let's assume you want to run a 3-node rqlite cluster. The network addresses of the nodes are `$HOST1`, `$HOST2`, and `$HOST3`. To bootstrap the cluster, use the `-bootstrap-expect` option like so: + +Node 1: +```bash +rqlited -node-id 1 -http-addr=$HOST1:4001 -raft-addr=$HOST1:4002 \ +-bootstrap-expect 3 -join http://$HOST1:4001,http://$HOST2:4001,http://$HOST3:4001 data +``` +Node 2: +```bash +rqlited -node-id 2 -http-addr=$HOST2:4001 -raft-addr=$HOST2:4002 \ +-bootstrap-expect 3 -join http://$HOST1:4001,http://$HOST2:4001,http://$HOST3:4001 data +``` +Node 3: +```bash +rqlited -node-id 3 -http-addr=$HOST3:4001 -raft-addr=$HOST3:4002 \ +-bootstrap-expect 3 -join http://$HOST1:4001,http://$HOST2:4001,http://$HOST3:4001 data +``` + +`-bootstrap-expect` should be set to the number of nodes that must be available before the bootstrapping process will commence, in this case 3. You also set `-join` to the HTTP URL of all 3 nodes in the cluster. **It's also required that each launch command has the same values for `-bootstrap-expect` and `-join`.** + +After the cluster has formed, you can launch more nodes with the same options. A node will always attempt to first perform a normal cluster-join using the given join addresses, before trying the bootstrap approach. + +#### Docker +With Docker you can launch every node identically: +```bash +docker run rqlite/rqlite -bootstrap-expect 3 -join http://$HOST1:4001,http://$HOST2:4001,http://$HOST3:4001 +``` +where `$HOST[1-3]` are the expected network addresses of the containers. + +__________________________ + +### Using DNS for Bootstrapping +You can also use the Domain Name System (DNS) to bootstrap a cluster. This is similar to automatic clustering, but doesn't require you to specify the network addresses of other nodes at the command line. Instead you create a DNS record for the host `rqlite.local`, with an [A Record](https://www.cloudflare.com/learning/dns/dns-records/dns-a-record/) for each rqlite node's IP address. + +To launch a node with node ID `$ID` and network address `$HOST`, using DNS for cluster boostrap, execute the following (example) command: +```bash +rqlited -node-id $ID -http-addr=$HOST:4001 -raft-addr=$HOST:4002 \ +-disco-mode=dns -disco-config='{"name":"rqlite.local"}' -bootstrap-expect 3 data +``` +You would launch other nodes similarly, setting `$ID` and `$HOST` as required for each node. In the example above, resolving `rqlite.local` should result in 3 IP addresses. + +#### DNS SRV +Using [DNS SRV](https://www.cloudflare.com/learning/dns/dns-records/dns-srv-record/) gives you more control over the rqlite node address details returned by DNS, including the HTTP port each node is listening on. This means that unlike using just simple DNS records, each rqlite node can be listening on a different HTTP port. Simple DNS records are probably good enough for most situations, however. + +To launch a node using DNS SRV boostrap, execute the following (example) command: +```bash +rqlited -node-id $ID -http-addr=$HOST:4001 -raft-addr=$HOST:4002 \ +-disco-mode=dns-srv -disco-config='{"name":"rqlite.local","service":"rqlite-svc"}' -bootstrap-expect 3 data +``` +You would launch other nodes similarly, setting `$ID` and `$HOST` as required for each node. You would launch other nodes similarly. In the example above rqlite will lookup SRV records at `_rqlite-svc._tcp.rqlite.local` +__________________________ + +### Kubernetes +DNS-based approaches can be quite useful for many deployment scenarios, in particular systems like Kubernetes. To learn how to deploy rqlite on Kubernetes, check the [Kubernetes deployment guide](https://github.com/rqlite/rqlite/blob/master/DOC/KUBERNETES.md). +__________________________ + +### Consul +Another approach uses [Consul](https://www.consul.io/) to coordinate clustering. The advantage of this approach is that you do not need to know the network addresses of all nodes ahead of time. + +Let's assume your Consul cluster is running at `http://example.com:8500`. Let's also assume that you are going to run 3 rqlite nodes, each node on a different machine. Launch your rqlite nodes as follows: + +Node 1: +```bash +rqlited -node-id $ID1 -http-addr=$HOST1:4001 -raft-addr=$HOST1:4002 \ +-disco-mode consul-kv -disco-config '{"address":"example.com:8500"}' data +``` +Node 2: +```bash +rqlited -node-id $ID2 -http-addr=$HOST2:4001 -raft-addr=$HOST2:4002 \ +-disco-mode consul-kv -disco-config '{"address":"example.com:8500"}' data +``` +Node 3: +```bash +rqlited -node-id $ID3 -http-addr=$HOST3:4001 -raft-addr=$HOST3:4002 \ +-disco-mode consul-kv -disco-config '{"address":"example.com:8500"}' data +``` + +These three nodes will automatically find each other, and cluster. You can start the nodes in any order and at anytime. Furthermore, the cluster Leader will continually update Consul with its address. This means other nodes can be launched later and automatically join the cluster, even if the Leader changes. Refer to the [_Next Steps_](#next-steps) documentation below for further details on Consul configuration. + +#### Docker +It's even easier with Docker, as you can launch every node almost identically: +```bash +docker run rqlite/rqlite -disco-mode=consul-kv -disco-config '{"address":"example.com:8500"}' +``` +__________________________ + +### etcd +A third approach uses [etcd](https://www.etcd.io/) to coordinate clustering. Autoclustering with etcd is very similar to Consul. Like when you use Consul, the advantage of this approach is that you do not need to know the network addresses of all the nodes ahead of time. + +Let's assume etcd is available at `example.com:2379`. + +Node 1: +```bash +rqlited -node-id $ID1 -http-addr=$HOST1:4001 -raft-addr=$HOST1:4002 \ + -disco-mode etcd-kv -disco-config '{"endpoints":["example.com:2379"]}' data +``` +Node 2: +```bash +rqlited -node-id $ID2 -http-addr=$HOST2:4001 -raft-addr=$HOST2:4002 \ + -disco-mode etcd-kv -disco-config '{"endpoints":["example.com:2379"]}' data +``` +Node 3: +```bash +rqlited -node-id $ID3 -http-addr=$HOST3:4001 -raft-addr=$HOST3:4002 \ + -disco-mode etcd-kv -disco-config '{"endpoints":["example.com:2379"]}' data +``` + Like with Consul autoclustering, the cluster Leader will continually report its address to etcd. Refer to the [_Next Steps_](#next-steps) documentation below for further details on etcd configuration. + + #### Docker +```bash +docker run rqlite/rqlite -disco-mode=etcd-kv -disco-config '{"endpoints":["example.com:2379"]}' +``` + +## Next Steps +### Customizing your configuration +For detailed control over Discovery configuration `-disco-confg` can either be an actual JSON string, or a path to a file containing a JSON-formatted configuration. The former option may be more convenient if the configuration you need to supply is very short, as in the examples above. + +The examples above demonstrates simple configurations, and most real deployments may require more detailed configuration. For example, your Consul system might be reachable over HTTPS. To more fully configure rqlite for Discovery, consult the relevant configuration specification below. You must create a JSON-formatted configuration which matches that described in the source code. + +- [Full Consul configuration description](https://github.com/rqlite/rqlite-disco-clients/blob/main/consul/config.go) +- [Full etcd configuration description](https://github.com/rqlite/rqlite-disco-clients/blob/main/etcd/config.go) +- [Full DNS configuration description](https://github.com/rqlite/rqlite-disco-clients/blob/main/dns/config.go) +- [Full DNS SRV configuration description](https://github.com/rqlite/rqlite-disco-clients/blob/main/dnssrv/config.go) + +#### Running multiple different clusters +If you wish a single Consul or etcd key-value system to support multiple rqlite clusters, then set the `-disco-key` command line argument to a different value for each cluster. To run multiple rqlite clusters with DNS, use a different domain name per cluster. + +## Design +When using Automatic Bootstrapping, each node notifies all other nodes of its existence. The first node to have a record of enough nodes (set by `-boostrap-expect`) forms the cluster. Only one node can bootstrap the cluster, any other node that attempts to do so later will fail, and instead become a Follower in the new cluster. + +When using either Consul or etcd for automatic clustering, rqlite uses the key-value store of each system. In each case the Leader atomically sets its HTTP URL, allowing other nodes to discover it. To prevent multiple nodes updating the Leader key at once, nodes uses a check-and-set operation, only updating the Leader key if it's value has not changed since it was last read by the node. See [this blog post](https://www.philipotoole.com/rqlite-7-0-designing-node-discovery-and-automatic-clustering/) for more details on the design. + +For DNS-based discovery, the rqlite nodes simply resolve the hostname, and use the returned network addresses, once the number of returned addresses is at least as great as the `-bootstrap-expect` value. Clustering then proceeds as though the network addresses were passed at the command line via `-join`. diff --git a/DOC/BACKUPS.md b/DOC/BACKUPS.md new file mode 100644 index 00000000..721530c2 --- /dev/null +++ b/DOC/BACKUPS.md @@ -0,0 +1,34 @@ +# Backups + +rqlite supports hot backing up a node. You can retrieve and write a copy of the underlying SQLite database to a file via the CLI: +``` +127.0.0.1:4001> .backup bak.sqlite3 +backup file written successfully +``` +This command will write the SQLite database file to `bak.sqlite3`. + +You can also access the rqlite API directly, via a HTTP `GET` request to the endpoint `/db/backup`. For example, using `curl`, and assuming the node is listening on `localhost:4001`, you could retrieve a backup as follows: +```bash +curl -s -XGET localhost:4001/db/backup -o bak.sqlite3 +``` +Note that if the node is not the Leader, the node will transparently forward the request to Leader, wait for the backup data from the Leader, and return it to the client. If, instead, you want a backup of SQLite database of the actual node that receives the request, add `noleader` to the URL as a query parameter. + +If you do not wish a Follower to transparently forward a backup request to a Leader, add `redirect` to the URL as a query parameter. In that case if a Follower receives a backup request the Follower will respond with [HTTP 301 Moved Permanently](https://en.wikipedia.org/wiki/HTTP_301) and include the address of the Leader as the `Location` header in the response. It is then up the clients to re-issue the command to the Leader. + +In either case the generated file can then be used to restore a node (or cluster) using the [restore API](https://github.com/rqlite/rqlite/blob/master/DOC/RESTORE_FROM_SQLITE.md). + +## Generating a SQL text dump +You can dump the database in SQL text format via the CLI as follows: +``` +127.0.0.1:4001> .dump bak.sql +SQL text file written successfully +``` +The API can also be accessed directly: +```bash +curl -s -XGET localhost:4001/db/backup?fmt=sql -o bak.sql +``` + +## Backup isolation level +The isolation offered by binary backups is `READ COMMITTED`. This means that any changes due to transactions to the database, that take place during the backup, will be reflected immediately once the transaction is committed, but not before. + +See the [SQLite documentation](https://www.sqlite.org/isolation.html) for more details. diff --git a/DOC/BULK.md b/DOC/BULK.md new file mode 100644 index 00000000..580e7f1e --- /dev/null +++ b/DOC/BULK.md @@ -0,0 +1,58 @@ +# Bulk API +The bulk API allows multiple updates or queries to be executed in a single request. Both non-paramterized and parameterized requests are supported by the Bulk API. The API does not support mixing the parameterized and non-parameterized form in a single request. + +A bulk update is contained within a single Raft log entry, so round-trips between nodes are at a minimum. This should result in much better throughput, if it is possible to use this kind of update. You can also ask rqlite to do the batching for you automatically, through the use of [_Queued Writes_](https://github.com/rqlite/rqlite/blob/master/DOC/QUEUED_WRITES.md). This relieves the client of doing any batching before transmitting a request to rqlite. + +## Updates +Bulk updates are supported. To execute multiple statements in one HTTP call, simply include the statements in the JSON array: + +_Non-parameterized example:_ +```bash +curl -XPOST 'localhost:4001/db/execute?pretty&timings' -H "Content-Type: application/json" -d "[ + \"INSERT INTO foo(name) VALUES('fiona')\", + \"INSERT INTO foo(name) VALUES('sinead')\" +]" +``` +_Parameterized example:_ +```bash +curl -XPOST 'localhost:4001/db/execute?pretty&timings' -H "Content-Type: application/json" -d '[ + ["INSERT INTO foo(name) VALUES(?)", "fiona"], + ["INSERT INTO foo(name) VALUES(?)", "sinead"] +]' +``` + +The response is of the form: + +```json +{ + "results": [ + { + "last_insert_id": 1, + "rows_affected": 1, + "time": 0.00759015 + }, + { + "last_insert_id": 2, + "rows_affected": 1, + "time": 0.00669015 + } + ], + "time": 0.869015 +} +``` +### Atomicity +Because a bulk operation is contained within a single Raft log entry, and only one Raft log entry is ever processed at one time, a bulk operation will never be interleaved with other requests. + +### Transaction support +You may still wish to set the `transaction` flag when issuing a bulk update. This ensures that if any error occurs while processing the bulk update, all changes will be rolled back. + +## Queries +If you want to execute more than one query per HTTP request then perform a POST, and place the queries in the body of the request as a JSON array. For example: + +```bash +curl -XPOST 'localhost:4001/db/query?pretty' -H "Content-Type: application/json" -d '[ + "SELECT * FROM foo", + "SELECT * FROM bar" +]' +``` +Parameterized statements are also supported. diff --git a/DOC/CLI.md b/DOC/CLI.md new file mode 100644 index 00000000..74a13f34 --- /dev/null +++ b/DOC/CLI.md @@ -0,0 +1,38 @@ +# Command Line Interface +rqlite comes with a CLI, which makes it easier to interact with a rqlite system. It is installed in the same directory as the node binary `rqlited`. Since rqlite is built on SQLite, you should consult the [SQLite query language documentation](https://www.sqlite.org/lang.html) for full details on what is supported. + +> **⚠ WARNING: Only enter one command at a time at CLI. Don't enter multiple commands at once, separated by ;** +> While it may work, mixing reads and writes to the database in a single CLI command results in undefined behavior. + +An example session is shown below. +```sh +$ rqlite +127.0.0.1:4001> CREATE TABLE foo (id INTEGER NOT NULL PRIMARY KEY, name TEXT) +0 row affected (0.000362 sec) +127.0.0.1:4001> .tables ++------+ +| name | ++------+ +| foo | ++------+ +127.0.0.1:4001> .schema ++---------------------------------------------------------------+ +| sql | ++---------------------------------------------------------------+ +| CREATE TABLE foo (id INTEGER NOT NULL PRIMARY KEY, name TEXT) | ++---------------------------------------------------------------+ +127.0.0.1:4001> INSERT INTO foo(name) VALUES("fiona") +1 row affected (0.000117 sec) +127.0.0.1:4001> SELECT * FROM foo ++----+-------+ +| id | name | ++----+-------+ +| 1 | fiona | ++----+-------+ +127.0.0.1:4001> quit +bye~ +``` +You can connect the CLI to any node in a cluster, and it will automatically forward its requests to the leader if needed. Pass `-h` to `rqlite` to learn more. + +## History +Command history is stored and reload between sessions, in a hidden file in the user's home directory named `.rqlite_history`. By default 100 previous commands are stored, though the value can be explicitly set via the environment variable `RQLITE_HISTFILESIZE`. diff --git a/DOC/CLUSTER_MGMT.md b/DOC/CLUSTER_MGMT.md new file mode 100644 index 00000000..71ad0a39 --- /dev/null +++ b/DOC/CLUSTER_MGMT.md @@ -0,0 +1,172 @@ +# Contents +* [General guidelines](#general-guidelines) +* [Creating a cluster](#creating-a-cluster) +* [Growing a cluster](#growing-a-cluster) +* [Modifying a node's Raft network addresses](#modifying-a-nodes-raft-network-addresses) +* [Removing or replacing a node](#removing-or-replacing-a-node) +* [Dealing with failure](#dealing-with-failure) + +# General guidelines +This document describes, in detail, how to create and manage a rqlite cluster. + +## Practical cluster size +Firstly, you should understand the basic requirement for systems built on the [Raft protocol](https://raft.github.io/). For a cluster of `N` nodes in size to remain operational, at least `(N/2)+1` nodes must be up and running, and be in contact with each other. For a single-node system (N=1) then (obviously) that single node must be running. For a 3-node cluster (N=3) at least 2 nodes must be running. For N=5, at least 3 nodes should be running, and so on. + +Clusters of 3, 5, 7, or 9, nodes are most practical. Clusters of those sizes can tolerate failures of 1, 2, 3, and 4 nodes respectively. + +Clusters with a greater number of nodes start to become unwieldy, due to the number of nodes that must be contacted before a database change can take place. There is no intrinsic limit to the number of nodes comprising a cluster, but the operational overload can increase with little benefit. + +### Read-only nodes +It is possible to run larger clusters if you just need nodes [from which you only need to read from](https://github.com/rqlite/rqlite/blob/master/DOC/READ_ONLY_NODES.md). When it comes to the Raft protocol, these nodes do not count towards `N`, since they do not [vote](https://raft.github.io/). + +## Clusters with an even-number of nodes +There is little point running clusters with even numbers of voting i.e. Raft nodes. To see why this is imagine you have one cluster of 3 nodes, and a second cluster of 4 nodes. In each case, for the cluster to reach consensus on a given change, a majority of nodes within the cluster are required to have agreed to the change. + +Specifically, a majority is defined as `(N/2)+1` where `N` is the number of nodes in the cluster. For a 3-node a majority is 2; for a 4-node cluster a majority is 3. Therefore a 3-node cluster can tolerate the failure of a single node. However a 4-node cluster can also only tolerate a failure of a single node. + +So a 4-node cluster is no more fault-tolerant than a 3-node cluster, so running a 4-node cluster provides no advantage over a 3-node cluster. Only a 5-node cluster can tolerate the failure of 2 nodes. An analogous argument applies to 5-node vs. 6-node clusters, and so on. + +# Creating a cluster +_This section describes manually creating a cluster. If you wish rqlite nodes to automatically find other, and form a cluster, check out [auto-clustering](https://github.com/rqlite/rqlite/blob/master/DOC/AUTO_CLUSTERING.md)._ + +Let's say you have 3 host machines, _host1_, _host2_, and _host3_, and that each of these hostnames resolves to an IP address reachable from the other hosts. Instead of hostnames you can use the IP address directly if you wish. + +To create a cluster you must first launch a node that can act as the initial leader. Do this as follows on _host1_: +```bash +host1:$ rqlited -node-id 1 -http-addr host1:4001 -raft-addr host1:4002 ~/node +``` +With this command a single node is started, listening for API requests on port 4001 and listening on port 4002 for intra-cluster communication and cluster-join requests from other nodes. Note that the addresses passed to `-http-addr` and `-raft-addr` must be reachable from other nodes so that nodes can find each other over the network -- these addresses will be broadcast to other nodes during the _Join_ operation. If a node needs to bind to one address, but broadcast a different address, you must set `-http-adv-addr` and `-raft-adv-addr`. + +`-node-id` can be any string, as long as it's unique for the cluster. It also shouldn't change, once chosen for this node. The network addresses can change however. This node stores its state at `~/node`. + +To join a second node to this leader, execute the following command on _host2_: +```bash +host2:$ rqlited -node-id 2 -http-addr host2:4001 -raft-addr host2:4002 -join http://host1:4001 ~/node +``` +_If a node receives a join request, and that node is not actually the leader of the cluster, the receiving node will automatically redirect the requesting node to the Leader node. As a result a node can actually join a cluster by contacting any node in the cluster. You can also specify multiple join addresses, and the node will try each address until joining is successful._ + +Once executed you now have a cluster of two nodes. Of course, for fault-tolerance you need a 3-node cluster, so launch a third node like so on _host3_: +```bash +host3:$ rqlited -node-id 3 -http-addr host3:4001 -raft-addr host3:4002 -join http://host1:4001 ~/node +``` +_When simply restarting a node, there is no further need to pass `-join`. However, if a node does attempt to join a cluster it is already a member of, and neither its node ID or Raft network address has changed, then the cluster Leader will ignore the join request as there is nothing to do -- the joining node is already a fully-configured member of the cluster. However, if either the node ID or Raft network address of the joining node has changed, the cluster Leader will first automatically remove the joining node from the cluster configuration before processing the join request. For most applications this is an implementation detail which can be safely ignored, and cluster-joins are basically idempotent._ + +You've now got a fault-tolerant, distributed, relational database. It can tolerate the failure of any node, even the leader, and remain operational. + +## Node IDs +You can set the Node ID (`-node-id`) to anything you wish, as long as it's unique for each node. + +## Listening on all interfaces +You can pass `0.0.0.0` to both `-http-addr` and `-raft-addr` if you wish a node to listen on all interfaces. You must still pass an explicit network address to `-join` however. In this case you'll also want to set `-http-adv-addr` and `-raft-adv-addr` to the actual interface addresses, so other nodes learn the correct network address to use to reach the node listening on `0.0.0.0`. + +## Through the firewall +On some networks, like AWS EC2 cloud, nodes may have an IP address that is not routable from outside the firewall. Instead these nodes are addressed using a different IP address. You can still form a rqlite cluster however -- check out [this tutorial](https://www.philipotoole.com/rqlite-v3-0-1-globally-replicating-sqlite/) for an example. The key thing is that you must set `-http-adv-addr` and `-raft-adv-addr` so a routable address is broadcast to other nodes. + +# Growing a cluster +You can grow a cluster, at anytime, simply by starting up a new node (pick a never before used node ID) and having it explicitly join with the leader as normal. The new node will automatically pick up all changes that have occurred on the cluster since the cluster first started. In otherwords, after joining successfully, the new node will have a full copy of the SQLite database, just like every other node in the cluster. + +# Modifying a node's Raft network addresses +It is possible to change a node's Raft address between restarts. Simply pass the new address on the command line. **You must also, however, explicitly tell the node to join the cluster again, by passing `-join` to the node**. In this case what the leader actually does is remove the previous record of the node, before adding a new record of the node. You can also change the HTTP API address of a node between restarts, but an explicit re-join is not required if just the HTTP API address changes. + +**Note that this process only works if your cluster has, in addition to the node with the changing address, a quorum (at least) of nodes up running**. If your cluster does not meet this requirement, see the section titled _Dealing with failure_. + +# Removing or replacing a node +If a node fails completely and is not coming back, or if you shut down a node because you wish to deprovision it, its record should also be removed from the cluster. To remove the record of a node from a cluster, execute the following command at the rqlite CLI: + +``` +127.0.0.1:4001> .remove +``` + +You can also make a direct call to the HTTP API to remove a node: + +``` +curl -XDELETE http://host:4001/remove -d '{"id": ""}' +``` +where `host` is any node in the cluster. If you do not remove a failed node the lLader will continually attempt to communicate with that node. **Note that the cluster must be functional -- there must still be an operational Leader -- for this removal to be successful**. If, after a node failure, a given cluster does not have a quorum of nodes still running, you must bring back the failed node. Any attempt to remove it will fail as there will be no Leader to respond to the node-removal request. + +If you cannot bring sufficient nodes back online such that the cluster can elect a leader, follow the instructions in the section titled _Dealing with failure_. + +## Automatically removing failed nodes +> :warning: **This functionality was introduced in version 7.11.0. It does not exist in earlier releases.** + +rqlite supports automatically removing both voting (the default type) and non-voting (read-only) nodes that have been non-reachable for a configurable period of time. A non-reachable node is defined as a node that the Leader cannot heartbeat with. To enable reaping of voting nodes set `-raft-reap-node-timeout` to a non-zero time interval. Likewise, to enable reaping of non-voting (read-only) nodes set `-raft-reap-read-only-node-timeout`. + +It is recommended that these values be set conservatively, especially for voting nodes. Setting them too low may mean you don't account for the normal kinds of network outages and tempoary failures that can affect distributed systems such as rqlite. Note that the timeout clock is reset if a cluster elects a new Leader. + +### Example configuration +Instruct rqlite to reap non-reachable voting nodes after 2 days, and non-reachable read-only nodes after 30 minutes: +```bash +rqlited -node-id 1 -raft-reap-node-timeout=48h -raft-reap-read-only-node-timeout=30m data +``` +For reaping to work consistently you **must** set these flags on **every** voting node in the cluster -- in otherwords, every node that could potentially become the Leader. You can also set the flags on read-only nodes, but they will simply be silently ignored. + +# Dealing with failure +It is the nature of clustered systems that nodes can fail at anytime. Depending on the size of your cluster, it will tolerate various amounts of failure. With a 3-node cluster, it can tolerate the failure of a single node, including the leader. + +If an rqlite process crashes, it is safe to simply to restart it. The node will pick up any changes that happened on the cluster while it was down. + +## Recovering a cluster that has permanently lost quorum +_This section borrows heavily from the Consul documentation._ + +In the event that multiple rqlite nodes are lost, causing a loss of quorum and a complete outage, partial recovery is possible using data on the remaining nodes in the cluster. There may be data loss in this situation because multiple servers were lost, so information about what's committed could be incomplete. The recovery process implicitly commits all outstanding Raft log entries, so it's also possible to commit data -- and therefore change the SQLite database -- that was uncommitted before the failure. + +**You may also need to follow the recovery process if a cluster simply restarts, but all nodes (or a quorum of nodes) come up with different network identitiers. This can happen in certain deployment configurations.** + +To begin, stop all remaining nodes. You can attempt a graceful node-removal, but it will not work in most cases. Do not worry if the remove operation results in an error. The cluster is in an unhealthy state, so this is expected. + +The next step is to go to the _data_ directory of each rqlite node you wish to bring back up. Inside that directory, there will be a `raft/` sub-directory. You need to create a `peers.json` file within that directory, which will contain the desired configuration of your recovered rqlite cluster (which may be smaller than the original cluster, perhaps even just a single recovered node). This file should be formatted as a JSON array containing the node ID, `address:port`, and suffrage information of each rqlite node in the cluster. + +Below is an example, of bringing a 3-node cluster back online. + +```json +[ + { + "id": "1", + "address": "10.1.0.1:8300", + "non_voter": false + }, + { + "id": "2", + "address": "10.1.0.2:8300", + "non_voter": false + }, + { + "id": "3", + "address": "10.1.0.3:8300", + "non_voter": false + } +] +``` + +`id` specifies the node ID of the server, which must not be changed from its previous value. The ID for a given node can be found in the logs when the node starts up if it was auto-generated. `address` specifies the desired Raft IP and port for the node, which does not need to be the same as previously. You can use hostnames instead of IP addresses if you prefer. `non_voter` controls whether the server is a read-only node. If omitted, it will default to false, which is typical for most rqlite nodes. + +Next simply create entries for all the nodes you plan to bring up (in the example above that's 3 nodes). You must confirm that nodes you don't include here have indeed failed and will not later rejoin the cluster. Ensure that this file is the same across all remaining rqlite nodes. At this point, you can restart your rqlite cluster. In the example above, this means you'd start 3 nodes. + +Once recovery is completed, the `peers.json` file is renamed to `peers.info`. `peers.info` will not trigger further recoveries, and simply acts as a record for future reference. It may be deleted at anytime. + +# Example Cluster Sizes +_Quorum is defined as (N/2)+1 where N is the size of the cluster._ + +## 2-node cluster +Quorum of a 2-node cluster is 2. + +If 1 node fails, quorum can no longer reached. The failing node must be recovered, as the failed node cannot be removed, and a new node cannot be added to the cluster to takes its place. This is why you shouldn't run 2-node clusters, except for testing purposes. In general it doesn't make much sense to run clusters with even-number of nodes at all. + +If you remove a single node from a fully-functional 2-node cluster, quorum will be reduced to 1 since you will be left with a 1-node cluster. + +## 3-node cluster +Quorum of a 3-node cluster is 2. + +If 1 node fails, the cluster can still reach quorum. Remove the failing node, or restart it. If you remove the node, quorum remains at 2. You should add a new node to get the cluster back to 3 nodes in size. If 2 nodes fail, the cluster will not be able to reach quorum. You must instead restart at least one of the nodes. + +If you remove a single node from a fully-functional 3-node cluster, quorum will be unchanged since you now have a 2-node cluster. + +## 4-node cluster +Quorum of a 4-node cluster is 3. + +The situation is similar for a 3-node cluster, in the sense that it can only tolerate the failure of a single node. If you remove a single node from a fully-functional 4-node cluster, quorum will decrease to 2 you now have a 3-node cluster. + +## 5-node cluster +Quorum of a 5-node cluster is 3. + +With a 5-node cluster, the cluster can tolerate the failure of 2 nodes. However if 3 nodes fail, at least one of those nodes must be restarted before you can make any change. If you remove a single node from a fully-functional 5-node cluster, quorum will be unchanged since you now have a 4-node cluster. diff --git a/DOC/CONSISTENCY.md b/DOC/CONSISTENCY.md new file mode 100644 index 00000000..ff405c9c --- /dev/null +++ b/DOC/CONSISTENCY.md @@ -0,0 +1,57 @@ +# Read Consistency +_rqlite has been run through Jepsen-style testing. You can read about it [here](https://github.com/wildarch/jepsen.rqlite/blob/main/doc/blog.md)._ + +Even though serving queries does not require Raft consensus (because the database is not changed), [queries should generally be served by the Leader](https://github.com/rqlite/rqlite/issues/5). Why is this? Because, without this check, queries on a node could return results that are out-of-date i.e. _stale_. This could happen for one, or both, of the following two reasons: + + * The node, while still part of the cluster, has fallen behind the Leader in terms of updates to its underlying database. + * The node is no longer part of the cluster, and has stopped receiving Raft log updates. + +This is why rqlite offers selectable read consistency levels of _none_, _weak_, and _strong_. Each is explained below, and examples of each are shown at the end of this document. + +## None +With _none_, the node simply queries its local SQLite database, and does not care if it's a Leader or Follower. This offers the fastest query response, but suffers from the potential issues listed above. + +### Limiting read staleness +You can tell the node not return results (effectively) older than a certain time, however. If a read request sets the query parameter `freshness` to a [Go duration string](https://golang.org/pkg/time/#Duration), the node serving the read will check that less time has passed since it was last in contact with the Leader, than that specified via freshness. If more time has passed the node will return an error. `freshness` is ignored for all consistency levels except `none`, and is also ignored if set to zero. + +> :warning: **The `freshness` parameter is always ignored if the node serving the query is the Leader**. Any read, when served by the leader, is always going to be within any possible freshness bound. + +If you decide to deploy [read-only nodes](https://github.com/rqlite/rqlite/blob/master/DOC/READ_ONLY_NODES.md) however, _none_ combined with `freshness` can be a particularly effective at adding read scalability to your system. You can use lots of read-only nodes, yet be sure that a given node serving a request has not fallen too far behind the Leader (or even become disconnected from the cluster). + +## Weak +If a query request is sent to a follower, and _weak_ consistency is specified, the Follower will transparently forward the request to the Leader. The Follower waits for the response from the Leader, and then returns that response to the client. + +_Weak_ instructs the Leader to check that it is the Leader, before querying the local SQLite file. Checking Leader state only involves checking state local to the Leader, so is still very fast. There is, however, a very small window of time (milliseconds by default) during which the node may return stale data. This is because after the local Leader check, but before the local SQLite database is read, another node could be elected Leader and make changes to the cluster. As result the node may not be quite up-to-date with the rest of cluster. + +## Strong +If a query request is sent to a follower, and _strong_ consistency is specified, the Follower will transparently forward the request to the Leader. The Follower waits for the response from the Leader, and then returns that response to the client. + +To avoid even the issues associated with _weak_ consistency, rqlite also offers _strong_. In this mode, the Leader sends the query through the Raft consensus system, ensuring that the Leader **remains** the Leader at all times during query processing. When using _strong_ you can be sure that the database reflects every change sent to it prior to the query. However, this will involve the Leader contacting at least a quorum of nodes, and will therefore increase query response times. + +# Which should I use? +_Weak_ is probably sufficient for most applications, and is the default read consistency level. Unless the leader on your cluster is continually changing there will be no difference between _weak_ and _strong_ -- but using _strong_ will result in more Raft traffic, which is not what most people want. + +To explicitly select consistency, set the query param `level` to the desired level. However, you should use _none_ with read-only nodes, unless you want those nodes to actually forward the query to the Leader. + +## Example queries +Examples of enabling each read consistency level for a simple query is shown below. + +```bash +# Query the node, telling it simply to read the SQLite database directly. No guarantees on how old the data is. +# In fact the node may not even be connected to the cluster. +curl -G 'localhost:4001/db/query?level=none' --data-urlencode 'q=SELECT * FROM foo' + +# Query the node, telling it simply to read the SQLite database directly. The read request will be successful +# only if the node last heard from the leader no more than 1 second ago. +curl -G 'localhost:4001/db/query?level=none&freshness=1s' --data-urlencode 'q=SELECT * FROM foo' + +# Default query options. The read request will be successful only if the node believes its the leader. +curl -G 'localhost:4001/db/query?level=weak' --data-urlencode 'q=SELECT * FROM foo' + +# Default query options. The read request will be successful only if the node believes it is the leader. Same as weak. +curl -G 'localhost:4001/db/query' --data-urlencode 'q=SELECT * FROM foo' + +# The read request will be successful only if the node maintained cluster leadership during +# the entirety of query processing. +curl -G 'localhost:4001/db/query?level=strong' --data-urlencode 'q=SELECT * FROM foo' +``` diff --git a/DOC/DATA_API.md b/DOC/DATA_API.md new file mode 100644 index 00000000..5076a90a --- /dev/null +++ b/DOC/DATA_API.md @@ -0,0 +1,230 @@ +# Data API + +Each rqlite node exposes an HTTP API allowing data to be inserted into, and read back from, the database. Any changes to the database (`INSERT`, `UPDATE`, `DELETE`) **must** be sent to the `/db/execute` endpoint, and reads (`SELECT`) should be sent to the `/db/query` endpoint. _It is important to use the correct endpoint for the operation you wish to perform._ + +The best way to understand the API is to work through the simple examples below. There are also [client libraries available](https://github.com/rqlite). + +## Writing Data +To write data successfully to the database, you must create at least 1 table. To do this perform a HTTP POST on the `/db/execute` endpoint on any rqlite node. Encapsulate the `CREATE TABLE` SQL command in a JSON array, and put it in the body of the request. An example via [curl](https://curl.haxx.se/): + +```bash +curl -XPOST 'localhost:4001/db/execute?pretty&timings' -H "Content-Type: application/json" -d '[ + "CREATE TABLE foo (id INTEGER NOT NULL PRIMARY KEY, name TEXT, age INTEGER)" +]' +``` + +To insert an entry into the database, execute a second SQL command: + +```bash +curl -XPOST 'localhost:4001/db/execute?pretty&timings' -H "Content-Type: application/json" -d '[ + "INSERT INTO foo(name, age) VALUES(\"fiona\", 20)" +]' +``` + +The response is of the form: + +```json +{ + "results": [ + { + "last_insert_id": 1, + "rows_affected": 1, + "time": 0.00886 + } + ], + "time": 0.0152 +} +``` + +The use of the URL param `pretty` is optional, and results in pretty-printed JSON responses. Time is measured in seconds. If you do not want timings, do not pass `timings` as a URL parameter. + +## Querying Data +Querying data is easy. For a single query simply perform an HTTP GET on the `/db/query` endpoint, setting the query statement as the query parameter `q`: + +```bash +curl -G 'localhost:4001/db/query?pretty&timings' --data-urlencode 'q=SELECT * FROM foo' +``` + +The **default** response is of the form: + +```json +{ + "results": [ + { + "columns": [ + "id", + "name", + "age" + ], + "types": [ + "integer", + "text", + "integer" + ], + "values": [ + [ + 1, + "fiona", + 20 + ] + ], + "time": 0.0150043 + } + ], + "time": 0.0220043 +} +``` + +You can also query via a HTTP POST request: +```bash +curl -XPOST 'localhost:4001/db/query?pretty&timings' -H "Content-Type: application/json" -d '[ + "SELECT * FROM foo" +]' +``` +The response will be in the same form as when the query is made via HTTP GET. + +### Associative response form +A alternative form of response can be requested, by adding `associative` as a query parameter: +```bash +curl -G 'localhost:4001/db/query?pretty&timings&associative' --data-urlencode 'q=SELECT * FROM foo' +``` +Response: +```json +{ + "results": [ + { + "types": {"id": "integer", "age": "integer", "name": "text"}, + "rows": [ + { "id": 1, "age": 20, "name": "fiona"}, + { "id": 2, "age": 25, "name": "declan"} + ], + "time": 0.000173061 + } + ], + "time": 0.000185964 +} +``` +This form will have a map per row returned, with each column name as a key. This form can be more convenient for clients, depending on the application. + +## Parameterized Statements +While the "raw" API described above can be convenient and simple to use, it is vulnerable to [SQL Injection attacks](https://owasp.org/www-community/attacks/SQL_Injection). To protect against this issue, rqlite also supports [SQLite parameterized statements](https://www.sqlite.org/lang_expr.html#varparam), for both read and writes. To use this feature, send the SQL statement and values as distinct elements within a new JSON array, as follows: + +_Writing data_ +```bash +curl -XPOST 'localhost:4001/db/execute?pretty&timings' -H "Content-Type: application/json" -d '[ + ["INSERT INTO foo(name, age) VALUES(?, ?)", "fiona", 20] +]' +``` +_Reading data_ +```bash +curl -XPOST 'localhost:4001/db/query?pretty&timings' -H "Content-Type: application/json" -d '[ + ["SELECT * FROM foo WHERE name=?", "fiona"] +]' +``` + +### Named Parameters +Named parameters are also supported. To use this feature set the values using a dictionary like so: + +_Writing data_ +```bash +curl -XPOST 'localhost:4001/db/execute?pretty&timings' -H "Content-Type: application/json" -d '[ + ["INSERT INTO foo(name, age) VALUES(:name, :age)", {"name": "fiona", "age": 20}] +]' +``` +_Reading data_ +```bash +curl -XPOST 'localhost:4001/db/query?pretty&timings' -H "Content-Type: application/json" -d '[ + ["SELECT * FROM foo WHERE name=:name", {"name": "fiona"}] +]' +``` + + +## Transactions +A **form** of transactions are supported. To execute statements within a transaction, add `transaction` to the URL. An example of the above operation executed within a transaction is shown below. + +```bash +curl -XPOST 'localhost:4001/db/execute?pretty&transaction' -H "Content-Type: application/json" -d "[ + \"INSERT INTO foo(name) VALUES('fiona')\", + \"INSERT INTO foo(name) VALUES('sinead')\" +]" +``` + +When a transaction takes place either both statements will succeed, or neither. Performance is *much, much* better if multiple SQL INSERTs or UPDATEs are executed via a transaction. Note that processing of the request ceases the moment any single query results in an error. + +The behaviour of rqlite if you explicitly issue `BEGIN`, `COMMIT`, `ROLLBACK`, `SAVEPOINT`, and `RELEASE` to control your own transactions is **not defined**. This is because the behavior of a cluster if it fails while such a manually-controlled transaction is not yet defined. It is important to control transactions only through the query parameters shown above. + +## Handling Errors +If an error occurs while processing a request, it will be indicated via the presence of an `error` key in the JSON response. For example: + +```bash +curl -XPOST 'localhost:4001/db/execute?pretty&timings' -H "Content-Type: application/json" -d "[ + \"INSERT INTO nonsense\" +]" +``` +```json +{ + "results": [ + { + "error": "near \"nonsense\": syntax error" + } + ], + "time": 2.478862 +} +``` + +## Queued Writes API +Queued Writes can provide an order-of-magnitude speed up in write-performance. You can learn about the Queued Writes API [here](https://github.com/rqlite/rqlite/blob/master/DOC/QUEUED_WRITES.md). + +## Bulk API +You can learn about the Bulk write API [here](https://github.com/rqlite/rqlite/blob/master/DOC/BULK.md). + +## How rqlite Handles Requests +_This section assumes a basic familiarity with the Raft protocol. A simple introduction to Raft can be found [here](http://thesecretlivesofdata.com/raft/)._ + +To make the very best use of the rqlite API, there are some important details to know. But understanding the following details is **not required** to make use of rqlite. + +With any rqlite cluster, all write-requests must be serviced by the cluster Leader -- this is due to the way the Raft consensus protocol works. If a client sends a write request to a Follower (or read-only, non-voting, node), the Follower transparently forwards the request to the Leader. The Follower waits for the response from the Leader, and returns it to the client. Any credential information included in the original HTTP request to the Follower is included with the forwarded request (assuming that permission checking also passes first on the Follower), and permission checking is performed on the Leader. + +Queries, by default, are also serviced by the cluster Leader. Like write-requests, Followers will, by default, transparently forward queries to the Leader, and respond to the client after receiving the response from the Leader. However, depending on the [read-consistency](https://github.com/rqlite/rqlite/blob/master/DOC/CONSISTENCY.md) specified with the request, if a Follower received the query request it may serve that request directly and not contact the Leader. Which read-consistency level makes sense depends on your application. + +### Data and the Raft log +Any writes to the SQLite database go through the Raft log, ensuring only changes committed by a quorum of rqlite nodes are actually applied to the SQLite database. Queries do not __necessarily__ go through the Raft log, however, since they do not change the state of the database, and therefore do not need to be captured in the log. Only if _Strong_ read consistency is requested does a query go through the Raft log. + +### Request Forwarding Timeouts +If a Follower forwards a request to a Leader, by default the Leader must respond within 30 seconds. You can control this timeout by setting the `timeout` parameter. For example, to set a 2 minute timeout, you would issue the following request: +```bash +curl -XPOST 'localhost:4001/db/execute?timeout=2m' -H "Content-Type: application/json" -d '[ + ["INSERT INTO foo(name, age) VALUES(?, ?)", "fiona", 20] +]' +``` + +### Disabling Request Forwarding +If you do not wish a Follower to transparently forward a request to a Leader, add `redirect` to the URL as a query parameter. In that case if a Follower receives a request that can only be serviced by the Leader, the Follower will respond with [HTTP 301 Moved Permanently](https://en.wikipedia.org/wiki/HTTP_301) and include the address of the Leader as the `Location` header in the response. It is then up the clients to re-issue the command to the Leader. + +This option was made available as it provides maximum visibility to the clients, should they prefer if. For example, if a Follower transparently forwarded a request to the Leader, and one of the nodes then crashed during processing, it may be difficult for the client to determine where in the chain of nodes the processing failed. + +### Example of redirect on query +``` +$ curl -v -G 'localhost:4003/db/query?pretty&timings&redirect' --data-urlencode 'q=SELECT * FROM foo' +* Trying ::1... +* connect to ::1 port 4003 failed: Connection refused +* Trying 127.0.0.1... +* Connected to localhost (127.0.0.1) port 4003 (#0) +> GET /db/query?pretty&timings&q=SELECT%20%2A%20FROM%20foo HTTP/1.1 +> Host: localhost:4003 +> User-Agent: curl/7.43.0 +> Accept: */* +> +< HTTP/1.1 301 Moved Permanently +< Content-Type: application/json; charset=utf-8 +< Location: http://localhost:4001/db/query?pretty&timings&q=SELECT%20%2A%20FROM%20foo +< X-Rqlite-Version: 4 +< Date: Mon, 07 Aug 2017 21:10:59 GMT +< Content-Length: 116 +< +Moved Permanently. + +* Connection #0 to host localhost left intact +``` + + diff --git a/DOC/DESIGN.md b/DOC/DESIGN.md new file mode 100644 index 00000000..03ee6a6a --- /dev/null +++ b/DOC/DESIGN.md @@ -0,0 +1,22 @@ +# rqlite +You can find details on the design and implementation of rqlite from [these blog posts](https://www.philipotoole.com/tag/rqlite/) (in particular [this post](https://www.philipotoole.com/replicating-sqlite-using-raft-consensus/) and [this post](https://www.philipotoole.com/rqlite-replicated-sqlite-with-new-raft-consensus-and-api/)). + +## Design Presentations +- [Presentation](https://docs.google.com/presentation/d/1E0MpQbUA6JOP2GjA60CNN0ER8fia0TP6kdJ41U9Jdy4/edit#slide=id.p) given to Hacker Nights NYC, March 2022. +- [Presentation]( https://www.philipotoole.com/2021-rqlite-cmu-tech-talk) given to the [Carnegie Mellon Database Group](https://db.cs.cmu.edu/), [September 2021](https://db.cs.cmu.edu/events/vaccination-2021-rqlite-the-distributed-database-built-on-raft-and-sqlite-philip-otoole/). There is also a [video recording](https://www.youtube.com/watch?v=JLlIAWjvHxM) of the talk. +- [Presentation](https://docs.google.com/presentation/d/1lSNrZJUbAGD-ZsfD8B6_VPLVjq5zb7SlJMzDblq2yzU/edit?usp=sharing) given to the University of Pittsburgh, April 2018. +- [Presentation](https://www.slideshare.net/PhilipOToole/rqlite-replicating-sqlite-via-raft-consensu) given at the [GoSF](https://www.meetup.com/golangsf/) [April 2016](https://www.meetup.com/golangsf/events/230127735/) Meetup. + +## Node design +The diagram below shows a high-level view of a rqlite node. +![node-design](https://user-images.githubusercontent.com/536312/133258366-1f2fbc50-8493-4ba6-8d62-04c57e39eb6f.png) + +## File system +### Raft +The Raft layer always creates a file -- it creates the _Raft log_. This log stores the set of committed SQLite commands, in the order which they were executed. This log is authoritative record of every change that has happened to the system. It may also contain some read-only queries as entries, depending on read-consistency choices. Since every node in an rqlite cluster applies the entries log in exactly the same way, this guarantees that the SQLite database is the same on every node. + +### SQLite +By default, the SQLite layer doesn't create a file. Instead, it creates the database in memory. rqlite can create the SQLite database on disk, if so configured at start-time, by passing `-on-disk` to `rqlited` at startup. Regardless of whether rqlite creates a database entirely in memory, or on disk, the SQLite database is completely recreated everytime `rqlited` starts, using the information stored in the Raft log. + +## Log Compaction and Truncation +rqlite automatically performs log compaction, so that disk usage due to the log remains bounded. After a configurable number of changes rqlite snapshots the SQLite database, and truncates the Raft log. This is a technical feature of the Raft consensus system, and most users of rqlite need not be concerned with this. diff --git a/DOC/DIAGNOSTICS.md b/DOC/DIAGNOSTICS.md new file mode 100644 index 00000000..b38a3f54 --- /dev/null +++ b/DOC/DIAGNOSTICS.md @@ -0,0 +1,2 @@ +# Monitoring rqlite +Check out the [monitoring guide](https://rqlite.io/docs/guides/monitoring-rqlite/). diff --git a/DOC/FAQ.md b/DOC/FAQ.md new file mode 100644 index 00000000..1c0a7730 --- /dev/null +++ b/DOC/FAQ.md @@ -0,0 +1,130 @@ +# FAQ + +* [What exactly does rqlite do?](#what-exactly-does-rqlite-do) +* [Why would I use this, versus some other distributed database?](#why-would-i-use-this-versus-some-other-distributed-database) +* [How do I access the database?](#how-do-i-access-the-database) +* [How do I monitor rqlite?](#how-do-i-monitor-rqlite) +* [Is it a drop-in replacement for SQLite?](#is-it-a-drop-in-replacement-for-sqlite) +* [How do I deploy rqlite on Kubernetes?](#how-do-i-deploy-rqlite-on-kubernetes) +* [Can any node execute a write request, and have the system "synchronize it all"?](#can-any-node-execute-a-write-request-and-have-the-system-synchronize-it-all) +* [Can I send a read request to any node in the cluster?](#can-i-send-a-read-request-to-any-node-in-the-cluster) +* [rqlite is distributed. Does that mean it can increase SQLite performance?](#rqlite-is-distributed-does-that-mean-it-can-increase-sqlite-performance) +* [What is the best way to increase rqlite performance?](#what-is-the-best-way-to-increase-rqlite-performance) +* [Where does rqlite fit into the CAP theorem?](#where-does-rqlite-fit-into-the-cap-theorem) +* [Does rqlite require consensus be reached before a write is accepted?](#does-rqlite-require-consensus-be-reached-before-a-write-is-accepted) +* [How does a client detect a cluster partition?](#how-does-a-client-detect-a-cluster-partition) +* [Can I run a single node?](#can-i-run-a-single-node) +* [What is the maximum size of a cluster?](https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md#what-is-the-maximum-size-of-a-cluster) +* [Is rqlite a good match for a network of nodes that come and go -- perhaps thousands of them?](#is-rqlite-a-good-match-for-a-network-of-nodes-that-come-and-go----perhaps-thousands-of-them) +* [Can I use rqlite to broadcast changes to lots of other nodes -- perhaps hundreds -- as long as those nodes don't write data?](#can-i-use-rqlite-to-broadcast-changes-to-lots-of-other-nodes----perhaps-hundreds----as-long-as-those-nodes-dont-write-data) +* [What if read-only nodes -- or clients accessing read-only nodes -- want to write data?](#what-if-read-only-nodes----or-clients-accessing-read-only-nodes----want-to-write-data) +* [Does rqlite support transactions?](#does-rqlite-support-transactions) +* [Can I modify the SQLite file directly?](#can-i-modify-the-sqlite-file-directly) +* [Can I read the SQLite file directly?](#can-i-read-the-sqlite-file-directly) +* [Can I use rqlite to replicate my SQLite database to a second node?](#can-i-use-rqlite-to-replicate-my-sqlite-database-to-a-second-node) +* [Is the underlying serializable isolation level of SQLite maintained?](#is-the-underlying-serializable-isolation-level-of-sqlite-maintained) +* [Do concurrent writes block each other?](#do-concurrent-writes-block-each-other) +* [Do concurrent reads block each other?](#do-concurrent-reads-block-each-other) +* [How is it different than dqlite?](#how-is-it-different-than-dqlite) +* [How is it different than Litestream?](#How-is-it-different-than-litestream) + +## What exactly does rqlite do? +rqlite is about replicating a set of data, which has been written to it using SQL. The data is replicated for fault tolerance because your data is so important that you want multiple copies distributed in different places, you want be able to query your data even if some machines fail, or both. These different places could be different machines on a rack, or different machines, each in different buildings, or even different machines, [each on different continents](https://www.philipotoole.com/rqlite-v3-0-1-globally-replicating-sqlite/). + +On top of that, rqlite provides strong guarantees about what state any copy of that data is in, with respect to a special node called the _leader_. That is where Raft comes in. It prevents divergent copies of the data, and ensures there is an "authoritative" copy of that data at all times. + +## Why would I use this, versus some other distributed database? +**rqlite is very simple to deploy, run, and manage** -- in fact, simplicity-of-operation is a key design goal. It's also lightweight and easy to query. It's a single binary you can drop anywhere on a machine, and just start it, which makes it very convenient. It takes literally seconds [to configure and form a cluster](https://github.com/rqlite/rqlite/blob/master/DOC/CLUSTER_MGMT.md), which provides you with fault-tolerance and high-availability. With rqlite you have complete control over your database infrastructure, and the data it stores. + +That said, it's always possible it's _too_ simple for your needs. + +## How do I access the database? +The primary way to access the database is via the [HTTP API](https://github.com/rqlite/rqlite/blob/master/DOC/DATA_API.md). You can access it directly, or use a [client library](https://github.com/rqlite). For more casual use you can use the [command line tool](https://github.com/rqlite/rqlite/blob/master/DOC/CLI.md). It is also technically possible to [read the SQLite file directly](https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md#can-i-read-the-sqlite-file-directly), but it's not officially supported. + +## How do I monitor rqlite? +Check out the [monitoring documentation](https://github.com/rqlite/rqlite/blob/master/DOC/DIAGNOSTICS.md). + +## Is it a drop-in replacement for SQLite? +No. While it does use SQLite as its storage engine, you must only write to the database via the [HTTP API](https://github.com/rqlite/rqlite/blob/master/DOC/DATA_API.md). That said, since it basically exposes SQLite, all the power of that database is available. It is also possible that any system built on top of SQLite only needs small changes to work with rqlite. + +## How do I deploy rqlite on Kubernetes? +Check out the [Kubernetes deployment guide](https://github.com/rqlite/rqlite/blob/master/DOC/KUBERNETES.md). + +## Can any node execute a write request, and have the system "synchronize it all"? +The first thing to understand is that you can send your write-request to any node in the cluster, and rqlite will do the right thing automatically. You do not need to direct your write requests specifically to the Leader node. + +Under the covers however, only the Leader can make changes to the database. If a client sends a write-request to a node and that node is not the Leader, the node will transparently forward the request to the Leader, wait for a response, and then return the response to the client. If the node receiving the write cannot contact the Leader, the write will fail and return an error to the client. + +## Can I send a read request to any node in the cluster? +Yes. If a read request must be serviced by the Leader, however, rqlite will transparently forward the request to the Leader, wait for the Leader to handle it, and return the results to the client. If the node receiving the write cannot contact the Leader, the write will fail and return an error to the client. + +Some reads, depending on the requested [_read consistency_](https://github.com/rqlite/rqlite/blob/master/DOC/CONSISTENCY.md), do not need to serviced by the Leader, and in that case the node can service the read regardless of whether it contact the Leader or not. + +## rqlite is distributed. Does that mean it can increase SQLite performance? +Yes, but only for reads. It does not provide any scaling for writes, since all writes must go through the leader. **rqlite is distributed primarily for replication and fault tolerance, not for peformance**. In fact write performance is reduced relative to a standalone SQLite database, because of the round-trips between nodes and the need to write to the Raft log. + +## What is the best way to increase rqlite performance? +The simplest way to increase performance is to use higher-performance disks and a lower-latency network. This is known as _scaling vertically_. You could also consider using [Queued Writes](https://github.com/rqlite/rqlite/blob/master/DOC/QUEUED_WRITES.md), or [Bulk Updates](https://github.com/rqlite/rqlite/blob/master/DOC/BULK.md) if you wish to improve write performance specifically. + +## Where does rqlite fit into the CAP theorem? +The [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem) states that it is impossible for a distributed database to provide consistency, availability, and partition tolerance simulataneously -- that, in the face of a network partition, the database can be available or consistent, but not both. + +Raft is a Consistency-Partition (CP) protocol. This means that if a rqlite cluster is partitioned, only the side of the cluster that contains a majority of the nodes will be available. The other side of the cluster will not respond to writes. However the side that remains available will return consistent results, and when the partition is healed, consistent results will continue to be returned. + +## Does rqlite require consensus be reached before a write is accepted? +Yes, this is an intrinsic part of the Raft protocol. How long it takes to reach [consensus](https://computersciencewiki.org/index.php/Distributed_consensus) depends primarily on your network. It will take two rounds trips from a leader to a quorum of nodes, though each of those nodes is contacted in parallel. + +There is one exception however, when rqlite does not wait for consensus. You can use [Queued Writes](https://github.com/rqlite/rqlite/blob/master/DOC/QUEUED_WRITES.md), which trades off durability for performance. + +## How does a client detect a cluster partition? +If the client is on the same side of the partition as a quorum of nodes, there will be no real problem, and any writes should succeed. However if the client is on the other side of the partition, one of two things will happen. The client may be redirected to the leader, but will then (presumably) fail to contact the leader due to the partition, and experience a timeout. Alternatively the client may receive a `no leader` error. + +It may be possible to make partitions clearer to clients in a future release. + +## Can I run a single node? +Sure. Many people do so, as they like accessing a SQLite database over HTTP. Of course, you won't have any redundancy or fault tolerance if you only run a single node. + +## What is the maximum size of a cluster? +There is no explicit maximum cluster size. However the [practical cluster size limit is about 9 _voting nodes_](https://github.com/rqlite/rqlite/blob/master/DOC/CLUSTER_MGMT.md). You can go bigger by adding [read-only nodes](https://github.com/rqlite/rqlite/blob/master/DOC/READ_ONLY_NODES.md). + +## Is rqlite a good match for a network of nodes that come and go -- perhaps thousands of them? +Unlikely. While rqlite does support read-only nodes, allowing it to scale to many nodes, the consensus protocol at the core of rqlite works best when the **voting** nodes in the cluster don't continually come and go. While it won't break, it probably won't be practical. + +However if the nodes that come and go only need to stay up-to-date with changes, and serve read requests, it might work. Learn about [read-only nodes](https://github.com/rqlite/rqlite/blob/master/DOC/READ_ONLY_NODES.md). + +## Can I use rqlite to broadcast changes to lots of other nodes -- perhaps hundreds -- as long as those nodes don't write data? +Yes, try out [read-only nodes](https://github.com/rqlite/rqlite/blob/master/DOC/READ_ONLY_NODES.md). + +## What if read-only nodes -- or clients accessing read-only nodes -- want to write data after all? +Then they must do it by sending write requests to the leader node. But if they can reach the leader node, it is an effective way for one node at the edge to send a message to all other nodes (well, at least other nodes that are connected to the cluster at that time). + +## Does rqlite support transactions? +It supports [a form of transactions](https://github.com/rqlite/rqlite/blob/master/DOC/DATA_API.md#transactions). You can wrap a bulk update in a transaction such that all the statements in the bulk request will succeed, or none of them will. However the behaviour or rqlite is undefined if you send explicit `BEGIN`, `COMMIT`, or `ROLLBACK` statements. This is not because they won't work -- they will -- but if your node (or cluster) fails while a transaction is in progress, the system may be left in a hard-to-use state. So until rqlite can offer strict guarantees about its behaviour if it fails during a transaction, using `BEGIN`, `COMMIT`, and `ROLLBACK` is officially unsupported. Unfortunately this does mean that rqlite may not be suitable for some applications. + +## Can I modify the SQLite file directly? +No, you must only change the database using the HTTP API. The moment you directly modify the SQLite file under any node (if running in _on-disk_ mode) the behavior of rqlite is undefined. In otherwords, you run the risk of breaking your cluster. + +## Can I read the SQLite file directly? +If you run a node in _on-disk_ mode, you can read the SQLite file directly. However this mode of operation has not been deeply tested. + +## Can I use rqlite to replicate my SQLite database to a second node? +Not in a simple sense, no. rqlite is not a SQLite database replication tool. While each node does have a full copy of the SQLite database, rqlite is not simply about replicating that database. + +## Is the underlying serializable isolation level of SQLite maintained? +Yes, it is. + +## Do concurrent writes block each other? +In this regard rqlite currently offers exactly the same semantics as SQLite. Each HTTP write request uses the same SQLite connection on the leader, so one write-over-HTTP may block another, due to the nature of SQLite. + +## Do concurrent reads block each other? +No, a read does not block other reads, nor does a read block a write. + +## How is it different than dqlite? +dqlite is library, written in C, that you need to integrate with your own software. That requires programming. rqlite is a standalone application -- it's a full [RDBMS](https://techterms.com/definition/rdbms) (albeit a relatively simple one). rqlite has everything you need to read and write data, and backup, maintain, and monitor the database itself. + +rqlite and dqlite are completely separate projects, and rqlite does not use dqlite. In fact, rqlite was created before dqlite. + +## How is it different than Litestream? +[Litestream](https://github.com/benbjohnson/litestream) adds reliability to a system using SQLite by periodically backing-up the SQLite database to something like AWS S3. If you lose the node running your SQLite database, you must restore it from your backup. Litestream does this in a very elegant way, and doesn't change how applications interact with SQLite. There is also a very small chance of data loss in the event of node failure, but Litestream allows you make trade-offs that suit your application. + +rqlite, in contrast, adds reliability **and** high-availability via clustering. This means that any application talking to rqlite shouldn't notice if a node fails because other nodes in the cluster automatically take over. This offers significantly more protection against any data loss. rqlite also offers very strict guarantees about the state of the SQLite database, relative to the Leader. rqlite is not a drop-in replacement for SQLite, however. diff --git a/DOC/FOREIGN_KEY_CONSTRAINTS.md b/DOC/FOREIGN_KEY_CONSTRAINTS.md new file mode 100644 index 00000000..d32aee2e --- /dev/null +++ b/DOC/FOREIGN_KEY_CONSTRAINTS.md @@ -0,0 +1,5 @@ +# Foreign Key Constraints + +Since SQLite does not enforce foreign key constraints by default, neither does rqlite. However you can enable foreign key constraints in rqlite via the command line option `-fk=true`. Setting this command line will enable Foreign Key constraints on all connections that rqlite makes to the underlying SQLite database. + +Issuing the `PRAGMA foreign_keys = boolean` command usually results in unpredictable behaviour, since rqlite doesn't offer connection-level control of the underlying SQLite database. It is not recommended. diff --git a/DOC/KUBERNETES.md b/DOC/KUBERNETES.md new file mode 100644 index 00000000..8536d76c --- /dev/null +++ b/DOC/KUBERNETES.md @@ -0,0 +1,2 @@ +# Running rqlite on Kubernetes +See the [official Kubernetes guide for rqlite](https://rqlite.io/docs/guides/kubernetes/). diff --git a/DOC/NON_DETERMINISTIC_FUNCTIONS.md b/DOC/NON_DETERMINISTIC_FUNCTIONS.md new file mode 100644 index 00000000..ca507792 --- /dev/null +++ b/DOC/NON_DETERMINISTIC_FUNCTIONS.md @@ -0,0 +1,66 @@ +# rqlite and Non-deterministic Functions + +## Contents +* [Understanding the problem](#understanding-the-problem) +* [How rqlite solves this problem](#how-rqlite-solves-this-problem) +* [What does rqlite rewrite?](#what-does-rqlite-rewrite) + * [RANDOM()](#random) + * [Date and time functions](#date-and-time-functions) +* [Credits](#credits) + +## Understanding the problem +rqlite peforms _statement-based replication_. This means that every SQL statement is usually stored in the Raft log exactly in the form it was received. Each rqlite node then reads the Raft log and applies the SQL statements it finds there to its own local copy of SQLite. + +But if a SQL statement contains a [non-deterministic function](https://www.sqlite.org/deterministic.html), this type of replication can result in different SQLite data under each node -- which is not meant to happen. For example, the following statement could result in a different SQLite database under each node: +``` +INSERT INTO foo (n) VALUES(random()); +``` +This is because `RANDOM()` is evaluated by each node independently, and `RANDOM()` will almost certainly return a different value on each node. + +## How rqlite solves this problem +An rqlite node addresses this issue by _rewriting_ received SQL statements that contain certain non-deterministic functions, evaluating the non-determinstic factor, before writing the statement to the Raft log. The rewritten statement is then applied to the SQLite database as usual. + +## What does rqlite rewrite? + +### `RANDOM()` +> :warning: **This functionality was introduced in version 7.7.0. It does not exist in earlier releases.** + +Any SQL statement containing `RANDOM()` is rewritten following these rules: +- the statement is part of a write-request i.e. the request is sent to the `/db/execute` HTTP API. +- the statement is part of a read-request i.e. the request is sent to the `/db/query` HTTP API **and** the read-request is made with _strong_ read consistency. +- if `RANDOM()` is used as an `ORDER BY` qualifier it is not rewritten. +- the HTTP request containing the SQL statement does not have the query parameter `norwrandom` present. + +`RANDOM()` is replaced with a random integer between -9223372036854775808 and +9223372036854775807 by the rqlite node that first receives the SQL statement. + +#### Examples +```bash +# Will be rewritten +curl -XPOST 'localhost:4001/db/execute' -H "Content-Type: application/json" -d '[ + "INSERT INTO foo(id, age) VALUES(1234, RANDOM())" +]' + +# RANDOM() rewriting explicitly disabled at request-time +curl -XPOST 'localhost:4001/db/execute?norwrandom' -H "Content-Type: application/json" -d '[ + "INSERT INTO foo(id, age) VALUES(1234, RANDOM())" +]' + +# Not rewritten +curl -G 'localhost:4001/db/query' --data-urlencode 'q=SELECT * FROM foo WHERE id = RANDOM()' + +# Rewritten +curl -G 'localhost:4001/db/query?level=strong' --data-urlencode 'q=SELECT * FROM foo WHERE id = RANDOM()' +``` + +### Date and time functions +rqlite does not yet rewrite [SQLite date and time functions](https://www.sqlite.org/lang_datefunc.html) that are non-deterministic in nature, but will in an upcoming release. A example of a non-deterministic time function is + +`INSERT INTO datetime_text (d1, d2) VALUES(datetime('now'),datetime('now', 'localtime'))` + +Using such functions will result in undefined behavior. Date and time functions that use absolute values will work without issue. + +#### CURRENT_TIME* +Using `CURRENT_TIMESTAMP`, `CURRENT_TIME`, and `CURRENT_DATE` can also be problematic, depending on your use case. + +## Credits +Many thanks to [Ben Johnson](https://github.com/benbjohnson) who wrote the SQLite parser used by rqlite. diff --git a/DOC/PERFORMANCE.md b/DOC/PERFORMANCE.md new file mode 100644 index 00000000..d04fc8b8 --- /dev/null +++ b/DOC/PERFORMANCE.md @@ -0,0 +1,66 @@ +# Understanding Performance + +rqlite replicates SQLite for fault-tolerance. It does not replicate it for performance. In fact performance is reduced relative to a standalone SQLite database due to the nature of distributed systems. _There is no such thing as a free lunch_. + +rqlite performance -- defined as the number of database updates performed in a given period of time -- is primarily determined by two factors: +- Disk performance +- Network latency + +Depending on your machine (particularly its IO performance) and network, individual INSERT performance could be anything from 10 operations per second to more than 200 operations per second. + +## Disk +Disk performance is the single biggest determinant of rqlite performance. This is because every change to the system must go through the Raft subsystem, and the Raft subsystem calls `fsync()` after every write to its log. Raft does this to ensure that the change is safely persisted in permanent storage before applying those changes to the SQLite database. This is why rqlite runs with an in-memory database by default, as using as on-disk SQLite database would put even more load on the disk, reducing the disk throughput available to Raft. + +## Network +When running a rqlite cluster, network latency is also a factor. This is because Raft must contact every node **twice** before a change is committed to the Raft log. Obviously the faster your network, the shorter the time to contact each node. + +# Improving Performance + +There are a few ways to improve performance, but not all will be suitable for a given application. + +## Batching +The more SQLite statements you can include in a single request to a rqlite node, the better the system will perform. + +By using the [bulk API](https://github.com/rqlite/rqlite/blob/master/DOC/BULK.md), transactions, or both, throughput will increase significantly, often by 2 orders of magnitude. This speed-up is due to the way Raft and SQLite work. So for high throughput, execute as many operations as possible within a single transaction. + +## Queued Writes +If you can tolerate a small risk of some data loss in the event that a node crashes, you could consider using the [Queued Writes API](https://github.com/rqlite/rqlite/blob/master/DOC/QUEUED_WRITES.md). Using Queued Writes can easily give you orders of magnitude improvement in perfomance. + +## Use more powerful hardware +Obviously running rqlite on better disks, better networks, or both, will improve performance. + +## Use a memory-backed filesystem +It is possible to run rqlite entirely on-top of a memory-backed file system. This means that **both** the Raft log and SQLite database would be stored in memory only. For example, on Linux you can create a memory-based filesystem like so: +```bash +mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk +``` +**This comes with risks, however**. The behavior of rqlite when a node fails, but committed entries the Raft log have not actually been permanently persisted, **is not defined**. But if your policy is to completely deprovision your rqlite node, or rqlite cluster, in the event of any node failure, this option may be of interest to you. Perhaps you always rebuild your rqlite cluster from a different source of data, so can recover an rqlite cluster regardless of its state. Testing shows that using rqlite with a memory-only file system can result in 100x improvement in performance. + +## Improving read-write concurrency +SQLite can offer better concurrent read and write support when using an on-disk database, compared to in-memory databases. But as explained above, using an on-disk SQLite database can significant impact performance. But since the database-update performance will be so much better with an in-memory database, improving read-write concurrency may not be needed in practise. + +However, if you enable an on-disk SQLite database, but then place the SQLite database on a memory-backed file system, you can have the best of both worlds. You can dedicate your disk to the Raft log, but still get better read-write concurrency with SQLite. You can specify the SQLite database file path via the `-on-disk-path` flag. + +An alternative approach would be to place the SQLite on-disk database on a disk different than that storing the Raft log, but this is unlikely to be as performant as an in-memory file system for the SQLite database. + +# In-memory Database Limits + +> :warning: **rqlite was not designed for very large datasets**: While there are no hardcoded limits in the rqlite software, the nature of Raft means that the entire SQLite database is periodically copied to disk, and occasionally copied, in full, between nodes. Your hardware may not be able to process those large data operations successfully. You should test your system carefully when working with multi-GB databases. + +In-memory SQLite databases (the default configuration) are currently limited to 2GiB in size. One way to get around this limit is to use an on-disk database, by passing `-on-disk` to `rqlited`. But this could impact write-performance significantly, since disk is slower than memory. If you switch to on-disk SQLite, and find write-performance suffers, there are a couple of ways to address this. One option is to place the Raft log on one disk, and the SQLite database on a different disk. + +Another option is to run rqlite in on-disk mode but place the SQLite database file on a memory-backed filesystem. That way you can use larger databases, and still have performance similar to running with an in-memory SQLite database. + +In either case to control where rqlite places the SQLite database file, set `-on-disk-startup` and `-on-disk-path` when launching `rqlited`. **Note that you should still place the `data` directory on an actual disk, so that the Raft log is always on a physical disk, to ensure your data is not lost if a node retarts.** + +## Linux example +An example of running rqlite with a SQLite file on a memory-backed file system, and keeping the data directory on persistent disk, is shown below. The data directory is where the Raft log is stored. The example below would allow up to a 4GB SQLite database. +```bash +# Create a RAM disk, and then launch rqlite, telling it to put the SQLite database on the RAM disk. +mount -t tmpfs -o size=4096m tmpfs /mnt/ramdisk +rqlited -on-disk -on-disk-startup -on-disk-path /mnt/ramdisk/node1/db.sqlite /path_to_persistent_disk/data +``` +where `/path_to_persistent_disk/data` is a directory path on your persistent disk. + +Setting `-on-disk-startup` is also important because it disables an optimization rqlite performs at startup, when using an on-disk SQLite database. rqlite, by default, initially builds any on-disk database in memory first, before moving it to disk. It does this to reduce startup times. But with databases larger than 2GiB, this optimization can cause rqlite to fail to start. To avoid this issue, you can disable this optimization via the flag, but your startup times may be noticeably longer. + diff --git a/DOC/PRAGMA.md b/DOC/PRAGMA.md new file mode 100644 index 00000000..0aeb684c --- /dev/null +++ b/DOC/PRAGMA.md @@ -0,0 +1,62 @@ +# PRAGMA Directives + +You can issue [`PRAGMA`](https://www.sqlite.org/pragma.html) directives to rqlite, and they will be passed to the underlying SQLite database. Certain `PRAGMA` directives, which alter the operation of the SQLite database, may not make sense in the context of rqlite (since rqlite does not given direct control over its connections to the SQLite database). Furthermore some `PRAGMA` directives may even break rqlite. If you have questions about a specific `PRAGMA` directive, the [rqlite Google Group](https://groups.google.com/group/rqlite) is a good place to start the discussion. + +`PRAGMA` directives which just return information about the SQLite database, without changing its operation, are always safe. + +## Issuing a `PRAGMA` directive +The rqlite CLI supports issuing `PRAGMA` directives. For example: +``` +127.0.0.1:4001> pragma compile_options ++----------------------------+ +| compile_options | ++----------------------------+ +| COMPILER=gcc-7.5.0 | ++----------------------------+ +| DEFAULT_WAL_SYNCHRONOUS=1 | ++----------------------------+ +| ENABLE_DBSTAT_VTAB | ++----------------------------+ +| ENABLE_FTS3 | ++----------------------------+ +| ENABLE_FTS3_PARENTHESIS | ++----------------------------+ +| ENABLE_JSON1 | ++----------------------------+ +| ENABLE_RTREE | ++----------------------------+ +| ENABLE_UPDATE_DELETE_LIMIT | ++----------------------------+ +| OMIT_DEPRECATED | ++----------------------------+ +| OMIT_SHARED_CACHE | ++----------------------------+ +| SYSTEM_MALLOC | ++----------------------------+ +| THREADSAFE=1 | ++----------------------------+ +``` + +`PRAGMA` directives may also be issued using the `/db/execute` or `/db/query` endpoint. For example: +``` +$ curl -G 'localhost:4001/db/query?pretty&timings' --data-urlencode 'q=PRAGMA foreign_keys' +{ + "results": [ + { + "columns": [ + "foreign_keys" + ], + "types": [ + "" + ], + "values": [ + [ + 0 + ] + ], + "time": 0.000070499 + } + ], + "time": 0.000540857 +}$ +``` diff --git a/DOC/QUEUED_WRITES.md b/DOC/QUEUED_WRITES.md new file mode 100644 index 00000000..e6cac601 --- /dev/null +++ b/DOC/QUEUED_WRITES.md @@ -0,0 +1,52 @@ +# Queued Writes +> :warning: **This functionality was introduced in release 7.5.0. It does not exist in earlier releases.** + +## Usage + +rqlite exposes a special API flag, which will instruct rqlite to queue up write-requests and execute them asynchronously. This allows clients to send multiple distinct requests to a rqlite node, and have rqlite automatically do the batching and bulk insert for the client, without the client doing any extra work. The net result is as if the client wrote a single Bulk request containing all the queued statements. For the same reason that using the [Bulk API](https://github.com/rqlite/rqlite/blob/master/DOC/BULK.md) results in much higher write performance, **using the _Queued Writes_ API will also result in much higher write performance**. + +This functionality is best illustrated by an example, showing two requests being queued. +```bash +$ curl -XPOST 'localhost:4001/db/execute?queue' -H "Content-Type: application/json" -d '[ + ["INSERT INTO foo(name) VALUES(?)", "fiona"], + ["INSERT INTO foo(name) VALUES(?)", "sinead"] +]' +{ + "results": [], + "sequence_number": 1653314298877648934 +} +$ curl -XPOST 'localhost:4001/db/execute?queue' -H "Content-Type: application/json" -d '[ + ["INSERT INTO foo(name) VALUES(?)", "declan"] +]' +{ + "results": [], + "sequence_number": 1653314298877649973 +} +$ +``` +Setting the URL query parameter `queue` enables queuing mode, adding the request data to an internal queue whch rqlite manages for you. + +rqlite will merge the requests, once a batch-size of them has been queued on the node or a configurable timeout expires, and execute them as though they had been both contained in a single request. + +Each response includes a monotonically-increasing `sequence_number`, which allows you to track when this request is actually persisted to the Raft log. The `/status` [diagnostics](https://github.com/rqlite/rqlite/blob/master/DOC/DIAGNOSTICS.md) endpoint includes the sequence number of the latest request successfully written to Raft. + +### Waiting for a queue to flush +You can explicitly tell the request to wait until the queue has persisted all pending requests. To do this, add the parameter `wait` to the request like so: +```bash +$ curl -XPOST 'localhost:4001/db/execute?queue&wait&timeout=10s' -H "Content-Type: application/json" -d '[ + ["INSERT INTO foo(name) VALUES(?)", "bob"] +]' +``` +This example also shows setting a timeout. If the queue has not emptied after this time, the request will return with an error. If not set, the time out is set to 30 seconds. + +### Configuring queue behaviour +The behaviour of the queue rqlite uses to batch the requests is configurable at rqlite launch time. You can change the minimum number of requests that must be present in the queue before they are written, as well as the timeout after which whatever is in the queue will be written regardless of queue size. Pass `-h` to `rqlited` to see the queue defaults, and list all command-line options. + +## Caveats +Like most databases there is a trade-off to be made between write-performance and durability, but for some applications these trade-offs are worth it. + +Because the API returns immediately after queuing the requests **but before the data is commited to the Raft log** there is a small risk of data loss in the event the node crashes before queued data is persisted. You can make this window arbitrarily small by adjusting the queuing parameters, at the cost of write performance. + +In addition, when the API returns `HTTP 200 OK`, that simply acknowledges that the data has been queued correctly. It does not indicate that the SQL statements will actually be applied successfully to the database. Be sure to check the node's logs and diagnostics if you have any concerns about failed queued writes. + +By default, writes from the queue are not performed within a transaction. If you wish to change this, it must be set at launch time via the command line option `-write-queue-tx`. diff --git a/DOC/README.md b/DOC/README.md index 67eb7cb3..b76631a6 100644 --- a/DOC/README.md +++ b/DOC/README.md @@ -1,2 +1,4 @@ # rqlite documentation -Visit [rqlite.io](https://rqlite.io) for the latest API documentation, Developer guides, and specialized instructions for cluster management, Kubernetes deployment, and design. +These docs are no longer updated, and remain here for legacy reference. + +Visit [rqlite.io](https://rqlite.io) for the latest documentation. diff --git a/DOC/READ_ONLY_NODES.md b/DOC/READ_ONLY_NODES.md new file mode 100644 index 00000000..b27866fe --- /dev/null +++ b/DOC/READ_ONLY_NODES.md @@ -0,0 +1,18 @@ +# Read-only nodes +rqlite supports adding _read-only_ nodes. You can use this feature to add read scalability to the cluster if you need a high volume of reads, or want to distribute copies of the data nearer to clients -- but don't want those nodes counted towards the quorum. These types of nodes are also known as _non-voting_ nodes. + +What this means is that a read-only node doesn't participate in the Raft consensus system i.e. it doesn't contribute towards quorum, nor does it cast a vote during the leader election process. Just like voting nodes, however, read-only nodes still subscribe to the stream of committed log entries broadcast by the Leader, and update the SQLite database using the log entries they receive from the Leader. + +## Querying a read-only node +Any read request to a read-only node must specify [read-consistency](https://github.com/rqlite/rqlite/blob/master/DOC/CONSISTENCY.md) level `none`. If any other consistency level is specified (or no level is explicitly specified) the read-only node will redirect the request to the leader. + +To ensure a read-only node hasn't become completely disconnected from the cluster, you will probably want to set the [`freshness` query parameter](https://github.com/rqlite/rqlite/blob/master/DOC/CONSISTENCY.md#limiting-read-staleness). + +## Enabling read-only mode +Pass `-raft-non-voter=true` to `rqlited` to enable read-only mode. + +## Read-only node management +Read-only nodes join a cluster in the [same manner as a voting node. They can also be removed using the same operations](https://github.com/rqlite/rqlite/blob/master/DOC/CLUSTER_MGMT.md). + +### Handling failure +If a read-only node becomes unreachable, the leader will continually attempt to reconnect until the node becomes reachable again, or the node is removed from the cluster. This is exactly the same behaviour as when a voting node fails. However, since read-only nodes do not vote, a failed read-only node will not prevent the cluster commiting changes via the Raft consensus mechanism. diff --git a/DOC/RESTORE_FROM_SQLITE.md b/DOC/RESTORE_FROM_SQLITE.md new file mode 100644 index 00000000..e593da38 --- /dev/null +++ b/DOC/RESTORE_FROM_SQLITE.md @@ -0,0 +1,70 @@ +# Restoring directly from SQLite + +rqlite supports loading a node directly from two sources, either of which can be used to initialize your system from preexisting SQLite data, or to restore from an existing [node backup](https://github.com/rqlite/rqlite/blob/master/DOC/BACKUPS.md): +- An actual SQLite database file. This is the fastest way to initialize a rqlite node from an existing SQLite database. Even large SQLite databases can be loaded into rqlite in a matter of seconds. This the recommended way to initialize your rqlite node from existing SQLite data. In addition any preexisting SQLite database is completely overwritten by this type of load operation, so it's safe to perform regardless of any data already loaded into your rqlite cluster. Finally, this type of load request can be sent to any node. The receiving node will transparently forward the request to the Leader as needed, and return the response of the Leader to the client. If you would prefer to be explicitly redirected to the Leader, add `redirect` as a URL query parameter. +- SQLite dump in text format. This is another convenient manner to initialize a system from an existing SQLite database (or other database). In constrast to loading an actual SQLite file, the behavior of this type of load operation is **undefined** if there is already data loaded into your rqlite cluster. **Note that if your source database is large, the operation can be quite slow.** If you find the restore times to be too long, you should first load the SQL statements directly into a SQLite database, and then restore rqlite using the resulting SQLite database file. + +## Examples +The following examples show a trivial database being generated by `sqlite3`, the SQLite file being backed up, converted to the corresponding list of SQL commands, and then loaded into a rqlite node listening on localhost using each form. + +### HTTP + _Be sure to set the Content-type header as shown, depending on the format of the upload._ + +```bash +~ $ sqlite3 restore.sqlite +SQLite version 3.14.1 2016-08-11 18:53:32 +Enter ".help" for usage hints. +sqlite> CREATE TABLE foo (id integer not null primary key, name text); +sqlite> INSERT INTO "foo" VALUES(1,'fiona'); +sqlite> +# Load directly from the SQLite file, which is the recommended process. +~ $ curl -v -XPOST localhost:4001/db/load -H "Content-type: application/octet-stream" --data-binary @restore.sqlite + +# Convert SQLite database file to set of SQL statements and then load +~ $ echo '.dump' | sqlite3 restore.sqlite > restore.dump +~ $ curl -XPOST localhost:4001/db/load -H "Content-type: text/plain" --data-binary @restore.dump +``` + +After either command, we can connect to the node, and check that the data has been loaded correctly. +```bash +$ rqlite +127.0.0.1:4001> SELECT * FROM foo ++----+-------+ +| id | name | ++----+-------+ +| 1 | fiona | ++----+-------+ +``` + +### rqlite CLI +The CLI supports loading from a SQLite database file or SQL text file. The CLI will automatically detect the type of data being used for the restore operation. Below shows an example of loading from the former. +``` +~ $ sqlite3 mydb.sqlite +SQLite version 3.22.0 2018-01-22 18:45:57 +Enter ".help" for usage hints. +sqlite> CREATE TABLE foo (id integer not null primary key, name text); +sqlite> INSERT INTO "foo" VALUES(1,'fiona'); +sqlite> .exit +~ $ ./rqlite +Welcome to the rqlite CLI. Enter ".help" for usage hints. +127.0.0.1:4001> .schema ++-----+ +| sql | ++-----+ +127.0.0.1:4001> .restore mydb.sqlite +database restored successfully +127.0.0.1:4001> select * from foo ++----+-------+ +| id | name | ++----+-------+ +| 1 | fiona | ++----+-------+ +``` + +## Caveats +Note that SQLite dump files normally contain a command to disable Foreign Key constraints. If you are running with Foreign Key Constraints enabled, and wish to re-enable this, this is the one time you should explicitly re-enable those constraints via the following `curl` command: +```bash +curl -XPOST 'localhost:4001/db/execute?pretty' -H "Content-Type: application/json" -d '[ + "PRAGMA foreign_keys = 1" +]' +``` diff --git a/DOC/SECURITY.md b/DOC/SECURITY.md new file mode 100644 index 00000000..03d3d670 --- /dev/null +++ b/DOC/SECURITY.md @@ -0,0 +1,102 @@ +# Securing rqlite +rqlite can be secured in various way, and with different levels of control. + +## File system security +You should control access to the data directory that each rqlite node uses. There is no reason for any user to directly access this directory. + +You are also responsible for securing access to the SQLite database files if you enable "on disk" mode (which is not the default mode). There is no reason for any user to directly access any SQLite file, and doing so may cause rqlite to work incorrectly. If you don't need to access a SQLite database file, then don't enable "on disk" mode. This will maximize file-level security. + +## Network security +Each rqlite node listens on 2 TCP ports -- one for the HTTP API, and the other for intra-cluster communications. Only the API port need be reachable from outside the cluster. + +So, if possible, configure the network such that the Raft port on each node is only accessible from other nodes in the cluster. There is no need for the Raft port to be accessible by rqlite clients. + +If the IP addresses (or subnets) of rqlite clients is also known, it may also be possible to limit access to the HTTP API from those addresses only. + +AWS EC2 [Security Groups](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html), for example, support all this functionality. So if running rqlite in the AWS EC2 cloud you can implement this level of security at the network level. + +## HTTPS API +rqlite supports HTTPS access, ensuring that all communication between clients and a cluster is encrypted. + +### Generating a certificate and private key +One way to generate the necessary resources is via [openssl](https://www.openssl.org/): +``` +openssl req -x509 -nodes -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 +``` +Note that unless you sign the certificate using a trusted authority, you will need to pass `-http-no-verify` to `rqlited`. + +## Node-to-node encryption +rqlite supports encryption of all inter-node traffic. To enable this, pass `-node-encrypt` to `rqlited`. Each node must also be supplied with the relevant SSL certificate and corresponding private key, in X.509 format. Note that every node in a cluster must operate with encryption enabled, or none at all. + +You can generate private keys and associated certificates in a similar manner as described in the _HTTP API_ section. + +## Basic Auth +The HTTP API supports [Basic Auth](https://tools.ietf.org/html/rfc2617). Each rqlite node can be passed a JSON-formatted configuration file, which configures valid usernames and associated passwords for that node. The password string can be in cleartext or [bcrypt hashed](https://en.wikipedia.org/wiki/Bcrypt). + +Since the configuration file only controls the node local to it, it's important to ensure the configuration is correct on each node. + +### User-level permissions +rqlite, via the configuration file, also supports user-level permissions. Each user can be granted one or more of the following permissions: +- _all_: user can perform all operations on a node. +- _execute_: user may access the execute endpoint. +- _query_: user may access the query endpoint. +- _load_: user may load an SQLite dump file into a node. +- _backup_: user may perform backups. +- _status_: user can retrieve node status and Go runtime information. +- _ready_: user can retrieve node readiness. +- _join_: user can join a cluster. In practice only a node joins a cluster, so it's the joining node that must supply the credentials. +- _join-read-only_: user can join a cluster, but only as a read-only node. +- _remove_: user can remove a node from a cluster. + +### Example configuration file +An example configuration file is shown below. +```json +[ + { + "username": "bob", + "password": "secret1", + "perms": ["all"] + }, + { + "username": "mary", + "password": "$2a$10$fKRHxrEuyDTP6tXIiDycr.nyC8Q7UMIfc31YMyXHDLgRDyhLK3VFS", + "perms": ["query", "backup", "join"] + }, + { + "username": "*", + "perms": ["status", "ready", "join-read-only"] + } +] +``` +This configuration file sets authentication for three usernames, _bob_, _mary_, and `*`. It sets a password for the first two. + +This configuration also sets permissions for all usernames. _bob_ has permission to perform all operations, but _mary_ can only query the cluster, as well as backup and join the cluster. `*` is a special username, which indicates that all users -- even anonymous users (requests without any BasicAuth information) -- have permission to check the cluster status and readiness. All users can also join as a read-only node. This can be useful if you wish to leave certain operations open to all accesses. + +## Secure cluster example +Starting a node with HTTPS enabled, node-to-node encryption, and with the above configuration file. It is assumed the HTTPS X.509 certificate and key are at the paths `server.crt` and `key.pem` respectively, and the node-to-node certificate and key are at `node.crt` and `node-key.pem` +```bash +rqlited -auth config.json -http-cert server.crt -http-key key.pem \ +-node-encrypt -node-cert node.crt -node-key node-key.pem -node-no-verify \ +~/node.1 +``` +Bringing up a second node on the same host, joining it to the first node. This allows you to block nodes from joining a cluster, unless those nodes supply a password. +```bash +rqlited -auth config.json -http-addr localhost:4003 -http-cert server.crt \ +-http-key key.pem -raft-addr :4004 -join https://bob:secret1@localhost:4001 \ +-node-encrypt -node-cert node.crt -node-key node-key.pem -node-no-verify \ +~/node.2 +``` +Querying the node, as user _mary_. +```bash +curl -G 'https://mary:secret2@localhost:4001/db/query?pretty&timings' \ +--data-urlencode 'q=SELECT * FROM foo' +``` + +### Avoiding passwords at the command line +The above example suffer from one shortcoming -- the password for user `bob` is entered at the command line. This is not ideal, as someone with access to the process table could learn the password. You can avoid this via the `-join-as` option, which will tell rqlite to retrieve the password from the configuration file. +```bash +rqlited -auth config.json -http-addr localhost:4003 -http-cert server.crt \ +-http-key key.pem -raft-addr :4004 -join https://localhost:4001 -join-as bob \ +-node-encrypt -node-cert node.crt -node-key node-key.pem -node-no-verify \ +~/node.2 +```