These iterators (if you'd call them) provide safe abstractions over raw
pointers, enabling us to easily use them in the storage engine, reducing
verbosity and redundancy.
This has the ability to introduce UB because advancing the cursor might
invalidate the source pointers obtained by `protocol::Parser::parse()`
That's why we only forward the cursor after the query has been
executed. This is absolutely fine because even if another query
packet has been buffered, we will only forward by the amount we read
and not anything more.
This commit does an extreme amount of `unsafe` work to reduce the amount
of data that is copied around. Previously, we:
1. Copied in from the buffer
2. Copied again when we did a ptr::read
This changes that to:
1. Zero-copies for operations that can operate on borrowed values
(i.e values by ref like GET, KEYLEN, ..)
2. One copy for operations that need owned values
Overall, this may not immediately seem beneficial for small queries
(afterall, computer memory is very fast), but for large keys -- this can
make a very significant difference. However, a big difference can be noticed
in memory usage under heavy load (with as much as a tenfold drop in some
cases).
But to make this possible, we have to break free from Rust's borrowing
rules as we can't normally pass around refs easily. So, we just turn
everything into raw pointers and operate on them directly. This is why
we introduce a significant amount of unsafe code.
Where actions were supposed to report an action error, they silently ran
their significant part, ignoring the rest. This was fixed in:
- `DROP TABLE`
- `INSPECT KEYSPACES`
- `INSPECT KEYSPACE <ksid>`
- `INSPECT TABLE <entity>`
- `USE <entity>`
Tests for the same were added
When a new instance is created, we need to:
1. Create the tree
This ONLY creates the directories
2. Create the PRELOAD
This is critical because this is our check for a new instance
3. Flush the tables
This is important because we have never flushed the tables/ks before.
If we don't do this -- the server would fail to start with a
`directory not found` error.
Checking all encoding errors beforehand simplifies a lot of things for
us. At the same time, this saves a lot of bandwidth as we don't have to
write encoding errors for each element -- instead we just write one.
* Add forceful dropping of keyspaces
This commit also improves the reliability of `drop keyspace` in general
* Add changelog
* Add tests for `force_drop_keyspace`
* Upgrade deps
This commit adds a storage_type segment to the PARTMAP disk file. This
contains information about the storage type of the table.
Is it volatile? Is it persistent? 8-bits were added for future
improvements.
* Use our own lock instead of parking_lot::Mutex
* Account for spurious failures in cmpxchg weak
* Ignore send error because parent may have panicked
The parent thread may have already panicked, dropping the rx.
* Enable maximum connections to be configured
* Add arbiter for handling server startup
* Add handling of maxcon for command-line args
* Add changelog entry
* Remove the need for TableLockStateGuard
The htable impl uses locks under the hood making external locks
redundant.
* Use atomics instead of rwlock for poisoned state
* Simplify snapshot locking
* Remove the pid file if runtime errors occur
* Clean up error handling and fix pid file creation
The pid file was being created before evaluating the args, now it may
happen that incorrect args or --help was passed: in that event, the pid
file remains created. This was also fixed, besides some refactoring.
* Deter other processes from using the same data dir
For more information, see #167
* Don't lock `pid_file`
Windows has mandatory locking so second instance won't be able to read
the PID of the other process. We'll just keep the file descriptor/handle
open
This is very useful because it removes the need for user intervention in
the event save on termination fails. Say the save operation fails due to
'some bad daemon' changing the directory's perms. Now skyd reports this
error while trying to save upon termination. Our sysadmin now fixes the
perms issue. The previous design would force the sysadmin to _somehow_
foreground skyd and hit enter. That is silly. The new design just
attempts to do a save operation every 10 seconds. So in case the issue
is fixed, the save operation will recover on its own.
Why not exponential backoff?
That's because the issue can be fixed some long time later and we may
have reached a large backoff value so the save that could have succeeded
would have to wait for a long duration before it can do anything
meaningful.
This also fixes a bug that caused BGSAVE errors to be reported as info
class log entries.
* Explicitly fsync and relax CPU on snap busy-loop
This commit also switches to using global `VERSION` and `URL` statics
than defining it per-crate.
* Add changelog entry and bump up version
* Optimize `dbtest` macro and rm redundant allocs
* Upgrade deps
This is a silly optimization but can be significantly faster for larger
action names. Also, since there are (should be) no UTF-8 characters in
the first argument, this is absolutely fine and sensible.
* Stop accepting writes if snapshotting fails
This is an important consideration: if BGSAVE fails and poisons the
database, snapshotting can and should too. But this is debatable in some
parts. For example, users may configure snapshots to be on a network
file system (symlinked maybe) and this can fail.
Now in some cases, this failure 'may be acceptable'. This commit adds a
way to customize this behavior through the `failsafe` key in the
snapshots section of the cfg file and through the --stop-write-on-fail
option passed to `skyd` on startup. However, BGSAVE remains unchanged:
it will always poison the database if it fails. If the user doesn't want
this, they can simply disable BGSAVE.
* Add changelog
* Create a new file on writing to flock-ed file
This fix is a very important one in two ways. Say we have an user A.
They go ahead and launch skyd. skyd creates a data.bin file. Now A just
deletes the data.bin file for fun. Funny enough, this never causes flock
to error!
Why? Well because the descriptor/handle is still valid and was just
unlinked from the current directory. But this might seem silly since
the user exits with a 'successfully saved notice' only to find that the
file never existed and all of their data was lost. That's bad.
There's a hidden problem in our current approach too, apart from this.
Our writing process begins by truncating the old file and then writing
to it by placing the cursor at 0. Nice, but what if this operation just
crashes. So we lost the current data AND the old data. Not good.
This commit does a better thing: it creates a new temporary file, locks
it before writing and then flushes the current data to the temporary
file. Once that succeeds, it replaces the old data.bin file with the
newly created file.
This solves both the problems mentioned here for us:
1. No more of the silly error
2. If BGSAVE crashes in between, we can be sure that at least the last
data.bin file is in proper shape and not half truncated or so.
This commit further moves the background services into their
own module(s) for easy management.
* Fix CI scripts
Fixes:
1. Our custom runner (drone/.ci.yml) was modified to kill the skyd
process once done since this pipeline is not ephemeral.
2. GHA for some reason ignores any error in the test step and proceeds
to kill the skyd process without erroring. Since GHA runners are
ephemeral, we don't need to do this manually.
What we did in the old implementation was pure over-engineering.
We relied on CoreDB's `Drop` impl to terminate the background services.
Now this is absolutely unreliable due to the nature of async functions.
We also relied on the bgsave scheduler to release the lock upon exit
which is also unreliable because we left the service to the mercy of the
runtime. We spawned the task and didn't hold as much as a `JoinHandle`
to it. That's bad because the runtime can just abort these tasks which
may result in the lock never being released. Even though it is designed
to release the lock on Drop, the destructor may however not be called at
all.
This commit fixes all those issues by simplifying the entire impl to
use Terminator. Now the background save and snapshot services run
independently, in their own tasks. Whenever the user passes a SIGINT,
we tell everyone to quit. The listeners understand that this is the
last query they'll process and the background save tasks exit almost
immediately. But what if some data was modified by this last query...?
No worries, that is completely handled by main(). The lock that BGSAVE
leaves is immediately (almost) returned to main and main will attempt
to flush the data almost immediately. That's how we maintain reliability
This commit ensures that BGSAVE is optimistic in doing what it is doing:
If BGSAVE fails once, it will immediately poison the table. Now let's
say that some amazing sysadmin managed to SSH into the server and was
able to fix the storage issue; BGSAVE would be able to succeed.
The current implementation was flawed: firstly it prevented that and
secondly even if it succeeded in running BGSAVE, the server would refuse
to accept writes. This commit fixes this behavior.
The size part of the metaline is absolutely redundant as we're doing
double the work while reading the size and then the real thing.
Since sizes won't have escape codes, we can freely read upto the LF
If Parser::will_cursor_give_char is set to not error if a char matches
or the next line is empty, return Ok(bool). If this_if_nothing_ahead is
set to false, then return a NotEnough error if no more chars are
available.
The newly added test explains why
It is likely that we'll change the HashMap implementation in the future,
hence its best to hide away the HashMap to make sure we can easily
replace it.
The previous logic was heavily flawed; it only had to check if the path
was a dir and isn't the remote snapshot directory.
Similarly, the file name parsing should only kick in if the item is a
file
This commit adds changes so that the main process almost immediately
acquires a lock on the data file when runtime is dropped. This is just
an added precaution to try and ensure that no other process does
something silly with the data file.
The descriptor is cloned for this using `FileLock::try_clone`
8e46e62 added a block_on_process_exit function that kept on sending
`notify_one()`s in a loop until the services terminated. This was
pointless as the `Drop` impl would do it for us anyways.
(What was I thinking?)
So, in main(), we're spawning an async task that lets the DB run as long
as we don't pass a ctrl_c (or some bad panic occurs). Once the ctrl_c
is received, we start terminating all workers. `block_on` returns DB
which should be the only one holding an atomic reference to the shared
field. We assert this right after dropping `runtime`.
Finally, the ECONNRESET suppression match was fixed to remove an
unreachable branch by adding conditional compilation
This commit ensures that the workers exit before attempting a flush_db
operation. Only after block_on_process_exit finishes we return `db`.
Now we run a simple flush_db operation knowing that the lock has been
released.
To block on process termination, we introduce a new function
block_on_process_exit that does the same thing as CoreDB's Drop
implementation.
Windows is the most ingenious OS in the world where filenames can
conflict with shell commands. That's right, con is an I/O device on
Windows and cannot be used for a filename! This is why we were having
checkout errors on Windows!
Vive la POSIX!
We have introduced a trait `BufferedSocketStream` that is a 'dummy'
trait and is implemented for both `SslStream<TcpStream>` and
`TcpStream`. So, the generic `Connection` object accepts any type that
implements the `BufferedSocketStream` trait (and hence should also
implement `AsyncWrite`)
This commit does a LOT! It migrates the `queryengine::execute_simple`,
`CoreDB::execute_query` and the kvengine functions to use generic
connections.
The object dbnet::Con was removed because it isn't needed anymore.
The listeners were also upgraded to use the generic connection handler
The trait `Con` and `ConOps` were renamed to `ProtocolConnectionExt`
and `ProtocolConnection`.
This naming scheme clearly explains that the Ext version 'augments' the
non-Ext impl. This is the very case here: ProtocolConnection provides
the basic funtions needed for interfacing with net I/O while the Ext
trait enables high-level interaction with the protocol and ultimately
queries.
A generic `ConnectionHandler` object was added that will replace the
SSL and non-SSL handler objects, again reducing redundancy.
Dummy execute functions were added to CoreDB and queryengine.
This commit defines two traits: `Con` and `ConOps`. Implementors of
`ConOps` get a free implementation for `Con`. `Con` is the ultimate
object that can be used in place of the current SSL/non-SSL connection
objects. If you look at the implementations of the current connection
objects, they have a lot of repetition as they do almost the same thing
except for the fact that they have a different underlying stream.
This is exactly what we're trying to eliminate. We will also define a
generic connection handler object to reduce redundancy.
Several changes were made to accomodate for this, including the addition
of the write_to_disk function that should be used by fns which don't
have a FileLock to pass for flushing data to the disk.
BGSAVE now takes ownership of a FileLock object which it uses for
running BGSAVE.
It seems that on Windows unlocking errors if the file has already been
unlocked. To fix this, we've added a platform-specific field to see if
the FileLock object has already been used to unlock the file.
This is the unlock field. In the Drop impl for Windows, we check the
unlocked flag to determine if we need to unlock the file.
This commit implements file locks for unix-based systems and windows
systems. This is done by using platform-specific `__sys` modules for
locking, trying to lock and unlocking files.
A build script was added for unix-systems that make use of the
flock-posix.c file
This commit removes the fscposix.c file and begins implementing native
file locking mechanisms for each platform (supported platforms)
BSD-style `flock`s were added
This commit adds a basic implementation of POSIX advisory record locking
which sets a lock on the `data.bin` file when the database server starts
and releases the lock when it terminates. This is just done for
compliance to let other processes know that we don't want them to use
the file.
However, the result depends entirely on the process that wants to do
'something' with the file. It is the responsibility of the process to
ensure that it respects the file lock.
Also, exclusive locks aren't perfect on Linux, so we can't rely on them.
See discussion #123 for more information.
This variant is absolutely redundant as we're just ignoring the entire
buffer and not just a part of it
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This closes#107, closes#108 and closes#109.
The configuration template was updated to include TLS/SSL and the
corresponding tests were also updated.
It also renames `sdb` to `skyd` for streamlining binary names.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This commit uses a Regex match iterator along with a few replace
operations to enable the parsing of quoted strings from arguments.
Previously, we simply ran a `split_whitespace()` to get the parts of the
ActionGroup, but now we're using this new Regex which enables arguments
like: 'SET me "sayan spaced"' to be passed and validated.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This commit now checks if the second value passed to MKSNAP points
to any parent/root directory before performing any action.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This call is necessary as SUPDATE returns Nil even if one of the keys
don't exist.
Also, this was note added to the actiondoc.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Just like fd139a9dda, this skip is not
needed as we're already breaking from the loop.
Also, all remaining unsafe blocks that were left in
2e85fbb831 have been explained.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
The skipping of the next value is absolutely unneeded as we're already
exiting the loop when the hash table contains the key.
Thus, this op was removed.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Dependencies were upgraded
A mirror sync badge was added. Also README was revised and workflow badge was fixed.
Signed-off-by: Sayan Nandan nandansayan@outlook.com
`openssl-sys` isn't required anymore; it was added in the `ssl` branch
during the development phase.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Also, dependencies were upgraded across all crates and the version for
`tdb-macros` was streamlined to 0.5.0 like the other crates.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
The data was being delievered in different batches, which
caused problems. This commit replaces the current stream
writer with a buffered writer ensuring good delivery.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
We were doing an extremely erroneous thing: writing to the TCP
socket instead of the SSL socket. This caused OpenSSL to report
problems on the client and server sides, telling us that there
was a problem with the SSL connection.
This commit revises the `write_lowlevel` trait impl for `SslStream` to
write to the SSL socket.
Also, the 'wrong' flushing of data for similar reasons has been fixed
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This commit enables queries to be executed on secure connections.
At the same time, the `execute_query_ssl` function was removed as
`execute_query` has been modified so that it can be used by both secure
and insecure connections
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This commit enables SSL settings to be read from command-line arguments
and from the configuration file.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This object either holds a mutable reference to an unencrypted TCP
stream or an encrypted TLS stream (using OpenSSL). The action handlers
were modified as a consequence.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This commit adds a basic SSL/TLS listener using `openssl`.
The `SslListener` object can accept a connection and get a decrypted
stream.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This makes sure that we can implement serialization for multiple types.
At the same time, we DON'T put in blanket implementations for this trait
as we want to make sure that unintended things don't get flushed.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
We've made the error handling slightly 'less aggressive' than the
previous version which `unwrap`ped here and there.
Also, the documentation for the entire snapstore module was greatly
improved.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This commit provides an implementation which allows multiple 'named'
namespaces to be serialized and deserialized.
No information about the data needs to be known for deserialization;
To facilitate this, a partition map is implemented which is stored as a
separate file. For now, the data file is called `snapstore.bin` and the
partition metadata file is called partmap.
The partition map (`PartMap`) contains a vector of `Partition` objects.
This object stores 'markers' (`len`) which are ideally byte positions
or offsets that demarcate the locations of the individual namespaces.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Previously we were cloning data before writing to disk which
caused slowdowns and lead to higher memory usage. This is an
attempt to fix this behavior.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Also the deps were upgraded (there's no point of dependabot creating
multiple commits for upgrading deps)
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
We previously made a big mistake: we tried to set poisoned to true, an
operation that requires a write lock, without dropping the read lock.
This commit ensures that we first drop `rlock` before doing anything
else. At the same time, if BGSAVE has failed, we'll just shut down the
service.
The error descriptions were improved for a failed `get_saved`
operation. And finally, a couple of variables in `MKSNAP` were renamed
to make them sound more sensible.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
For this, `Writable` was implemented for `&'static [u8]`.
Although this won't have a very noticeable impact on performance, we
will stick to using references instead of cloning the data and then
referencing it again.
Along with this, the docs for MKSNAP were updated.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Let's say that BGSAVE fails for some reason or the other. As it is
our responsibility to ensure the integrity of the user's data and
to enforce the reliability of the overall database, we should
immediately stop accepting writes, so that new changes aren't made
since the last time flushing data to the disk succeeded.
The functions under kvengine were modified as a consequence, and they
will return an "Internal Server Error" which is response code 5 if
the database is poisoned.
To put it in one line: If running BGSAVE fails, we'll `poison` the
in-memory table and stop accepting write operations, which ideally is
any operation that attempts to obtain a write lock.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
There is no merit in having a redundant `enabled` key under
`snapshot` in the configuration file; if a `snapshot` key is present,
it is inherently obvious that snapshotting is enabled
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Instead of checking to see if the remote snapshots directory exists
whenever MKSNAP is called, we'll create the directory when the server
starts up
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Since creating snapshots is quite an important utility,
there may be scenarios where creating one may be needed,
even if it is disabled on the server side. This commit
enables such snapshots to be created. This is achieved by
enabling MKSNAP to accept two arguments, where a 'named'
snapshot can be created, which is our "special" snapshot.
All these "special" snapshots are stored in a separate
"snapshots/remote" dir that is ignored by the
`SnapshotEngine`.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Following on from ab9561258e, we'll allow
the user to configure snapshots and BGSAVE via command-line arguments
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Until now, the database server could only be configured via the
configuration file. This commit enables the host, port and noart
options to be configured via command-line arguments.
This is important as there may be scenarios where creating a file
presents a challenge to the user.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
The user can now run `tdb -r <snapshotname>` to restore data from the
snapshot. Also, we'll show a note in the logs when trying to restore from
a snapshot
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
This module isn't doing anything good staying here, as it does nothing!
More work needs to be done on the file storage format internally before
a public rollout.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
Upstream changes in tokio have required several changes.
For example, `delay_until` was renamed to `sleep_until`. Similarly,
`notify` was renamed to `notify_one`.
Also, in tdb-macros, the `runtime::Builder::new()` line was changed into
`runtime::Builder::new_multi_thread()` due to changes upstream
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
If the duration for a periodic operation is set to zero in the
config file then it is likely that the thread would crash.
We want to prevent this at all costs, which is why
we're adding this check.
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
We don't need tests for MKSNAP when it is enabled as we already have
tests for snapshotting in `diskstore::snapshot`
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
To streamline the actions we'll make sure that mksnap returns code 3
if an incorrect number of arguments are provided
The actiondoc for the `mksnap` action was added and the contributing
guide was also updated
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
In `cli` other errors are now formatted in a `[ERR]` format
Also the documentation across the project was updated
Signed-off-by: Sayan Nandan <nandansayan@outlook.com>
`CoreDB` now has a `SnapshotStatus` object for this purpose
This object holds the current state of the service in an `RWLock`
When `mksnap` is called, `SnapEngine` sets `in_progress` to true
When the snapshot is created, `SnapEngine` sets `in_progress` to false
This prevents multiple entities from creating snapshots at the same time
The proc_macro can now be applied on modules only
All functions within the module will be considered to be a test
This has the advantage of not having to flag every test function
There can be use cases where the user wants to keep all the snapshots
This commit adds a way to keep all the snapshots, by setting maxtop to 0
When `maxtop` is set to 0, all the snapshots will be kept