-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: minimalistic storage layer setup #11
Changes from 9 commits
e1aa1d8
a26c52e
4400421
0e06331
0373b4b
d13c731
e073616
6337fa5
595bbe8
037ded3
32857c2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,17 @@ | ||
[workspace] | ||
members = ["optd-tmp"] | ||
members = ["optd"] | ||
resolver = "2" | ||
|
||
[workspace.dependencies] | ||
anyhow = "1" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably want to use |
||
chrono = "0.4.39" | ||
diesel = { version = "2.2", features = [ | ||
"sqlite", | ||
"returning_clauses_for_sqlite_3_35", | ||
"chrono", | ||
] } | ||
enum_dispatch = "0.3" | ||
# Using a bundled version of sqlite3-sys to avoid build issues. | ||
libsqlite3-sys = { version = "0.30", features = ["bundled"] } | ||
dotenvy = "0.15" | ||
diesel_migrations = "2.2" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# For documentation on how to configure this file, | ||
# see https://diesel.rs/guides/configuring-diesel-cli | ||
|
||
[print_schema] | ||
# The file diesel will write the generated schema to. | ||
file = "optd/src/storage/schema.rs" | ||
|
||
|
||
# A column of type `INTEGER PRIMARY KEY` becomes an alias for the 64-bit signed integer `ROWID`. | ||
# See https://sqlite.org/autoinc.html for more details. | ||
sqlite_integer_primary_key_is_bigint = true | ||
|
||
[migrations_directory] | ||
# The directory where the migration files are located. | ||
dir = "optd/migrations" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Working with diesel-rs | ||
|
||
[Diesel](https://diesel.rs/) is an ORM framework we use to persist the core objects in the optd query optimizer. We chose to work with Diesel instead of other alternatives mainly for its compile-time safety guarantees, which is a good companion for our table-per-operator-kind model. | ||
|
||
This guide assumes that you already have the `sqlite3` binary installed. | ||
|
||
## Setup | ||
|
||
When working with Diesel for the first time, you could use the convenient setup scripts located at `scripts/setup.sh`. The script will install the Diesel CLI tool, generate a testing memo table database at project root, and run the Diesel setup script. | ||
|
||
For more details, follow the [Getting Started with Diesel](https://diesel.rs/guides/getting-started.html) guide. | ||
|
||
## Making changes | ||
|
||
To generate a new migration, use the following command: | ||
|
||
```shell | ||
diesel migration generate <migration_name> | ||
``` | ||
|
||
Diesel CLI will create two empty files in the `optd-storgage/migrations` folder. You will see output that looks something like this: | ||
|
||
```shell | ||
Creating optd-storage/migrations/2025-01-20-153830_<migration_name>/up.sql | ||
Creating optd-storage/migrations/2025-01-20-153830_<migration_name>/down.sql | ||
``` | ||
|
||
The `up.sql` file should contain the changes you want to apply and `down.sql` should contain the command to revert the changes. | ||
|
||
Before optd becomes stable, it is ok to directly modify the migrations themselves. | ||
|
||
To apply the new migration, run: | ||
|
||
```shell | ||
diesel migration run | ||
``` | ||
|
||
You can also check that if `down.sql` properly revert the change: | ||
|
||
```shell | ||
diesel migration redo [-n <REDO_NUMBER>] | ||
``` | ||
|
||
You can also use the following command to revert changes: | ||
|
||
```shell | ||
diesel migration revert [-n <REVERT_NUMBER>] | ||
|
||
## Adding a new operator | ||
|
||
(TODO) | ||
|
||
## Adding a new property | ||
|
||
(TODO) |
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
[package] | ||
name = "optd" | ||
version = "0.1.0" | ||
edition = "2021" | ||
|
||
[dependencies] | ||
anyhow.workspace = true | ||
chrono.workspace = true | ||
diesel.workspace = true | ||
diesel_migrations.workspace = true | ||
dotenvy.workspace = true | ||
enum_dispatch.workspace = true | ||
libsqlite3-sys.workspace = true |
connortsui20 marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
DROP TABLE rel_groups; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
-- A relational group contains a set of relational expressions | ||
-- that are logically equivalent. | ||
CREATE TABLE rel_groups ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think expand out the "relational" here, and also be consistent with what the migration file directory is called too |
||
-- The group identifier. | ||
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, | ||
-- Time at which the group is created. | ||
created_at TIMESTAMP DEFAULT (CURRENT_TIMESTAMP) NOT NULL | ||
); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add TODOs here saying that we will add other columns like an optional winner, potentially cost, and also other group metadata related to our union-find parent pointer idea |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
DROP TABLE logical_op_kinds; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
-- The logical operator descriptors table specifies all the | ||
-- logical operators that can be used in optimizer. | ||
CREATE TABLE logical_op_kinds ( | ||
-- The identifier of the logical operator. | ||
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, | ||
-- The name of the logical operator. | ||
name TEXT NOT NULL | ||
); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
DROP TABLE physical_op_kinds; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
-- The physical operator descriptor table stores all the | ||
-- physical operators that can be used in the optimizer. | ||
CREATE TABLE physical_op_kinds ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TBH, I think having an enum here might be more practical. It will also make all the queries simpler and safe us a join. Might as well codegen everywhere, right? This table will never change at run time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll experiment with that during the WE. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, even though I haven't actually given the logic of this PR a review, we should be pushing as much work as we can to compile-time / codegen / preprocessing, especially since we want the foundation to be minimal at runtime so that we don't run into a wave of runtime bugs in the future |
||
-- The identifier of the physical operator. | ||
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, | ||
-- The name of the physical operator. | ||
name TEXT NOT NULL | ||
); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
DROP TABLE logical_exprs; |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like there might be an unnecessary level of indirection here? For the sake of this example, suppose we only have logical expression types. My understanding of this is that there is a What if instead of having a full table that makes this mapping, we instead encode the operator kind inside of the expression ID? For example, we could have it such that the upper 16 bits of a 64-bit ID encode the operator kind, and the lower 48 bits encode the unique expression given the operator kind (we would likely not even need all 16 bits, we could probably get away with the top 8 bits even). You could probably argue that this introduces complexity, but in my mind I feel that having extra tables that are not strictly necessary introduces even more complexity. When I was trying to read this PR I had to spend a non-trivial amount of time trying to understand the architecture, since I did not expect that specific table when we initially talked about the architecture in previous meetings. In terms of implementation of the above proposal, yes it would be a bit ugly, but if abstracted correctly at the storage layer level, there should only really be 1 function that handles the conversion into some sort of strongly-typed struct like this: struct ExprId {
kind: u16, // could also encode logical/physical difference before converting into stronger types
id: u64, // truncated at 48 bits
} This is opposed to having a dedicated table that makes this mapping, where every single expression lookup now has to look up and extra record on disk (and even in memory, this roundtrip is going to need to happen for every single lookup during optimization). As for the group ID mapping, I think that needs to be in a separate junction table regardless for group operations. Unless this is actually supposed to be the junction table? If that's the case, then all 3 things in this table would have needed to be a composite primary key tuple. I think there are tradeoffs to both models, but I am leaning towards removing the layer of indirection. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
-- The relational logical expressions table specifies | ||
-- which group a logical expression belongs to. | ||
CREATE TABLE logical_exprs ( | ||
-- The logical expression id. | ||
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we go with Chi's suggestion of having a unique ID for each object we shouldn't use autoincrement here |
||
-- The logical operator descriptor id. | ||
logical_op_kind_id BIGINT NOT NULL, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably doesn't need to be |
||
-- The group this logical expression belongs to. | ||
group_id BIGINT NOT NULL, -- groups.id | ||
-- Time at which the logical expression is created. | ||
created_at TIMESTAMP DEFAULT (CURRENT_TIMESTAMP) NOT NULL, | ||
FOREIGN KEY (logical_op_kind_id) REFERENCES logical_op_kinds(id) ON DELETE CASCADE ON UPDATE CASCADE, | ||
FOREIGN KEY (group_id) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE | ||
); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
DROP TABLE physical_exprs; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
-- The relational physical expressions table specifies which group | ||
-- a physical expression belongs to and the total cost for executing | ||
-- a physical plan rooted at this expression. | ||
CREATE TABLE physical_exprs ( | ||
-- The physical expression id. | ||
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, | ||
-- The physical operator descriptor id. | ||
physical_op_kind_id BIGINT NOT NULL, | ||
-- The group this physical expression belongs to. | ||
group_id BIGINT NOT NULL, | ||
-- The total cost for executing a physical plan rooted at this expression (FAKE). | ||
total_cost DOUBLE NOT NULL, | ||
-- Time at which the physical expression is created. | ||
created_at TIMESTAMP DEFAULT (CURRENT_TIMESTAMP) NOT NULL, | ||
FOREIGN KEY (physical_op_kind_id) REFERENCES physical_op_kinds(id) ON DELETE CASCADE ON UPDATE CASCADE, | ||
FOREIGN KEY (group_id) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE | ||
); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
-- Deregisters the logical join operator. | ||
DELETE FROM logical_op_kinds where name = 'LogicalJoin'; | ||
|
||
DROP TABLE logical_joins; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
-- Registers the logical join operator. | ||
CREATE TABLE logical_joins ( | ||
logical_expr_id INTEGER NOT NULL PRIMARY KEY, | ||
-- The type of join (inner, left, right, etc.). | ||
join_type INTEGER NOT NULL, | ||
-- The group id of the left child. | ||
left BIGINT NOT NULL, | ||
-- The group id of the right child. | ||
right BIGINT NOT NULL, | ||
-- The join condition (mocked). | ||
join_cond TEXT NOT NULL, | ||
FOREIGN KEY (logical_expr_id) REFERENCES logical_exprs(id) ON DELETE CASCADE ON UPDATE CASCADE, | ||
FOREIGN KEY (left) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE, | ||
FOREIGN KEY (right) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE | ||
); | ||
|
||
INSERT INTO logical_op_kinds (name) VALUES ('LogicalJoin'); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
-- Deregister the physical nested loop join operator. | ||
DELETE FROM physical_op_kinds where name = 'PhysicalNLJoin'; | ||
|
||
DROP TABLE physical_nljoins; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
-- Registers the physical nested loop join operator. | ||
CREATE TABLE physical_nljoins ( | ||
physical_expr_id INTEGER NOT NULL PRIMARY KEY, | ||
-- The type of join (inner, left, right, etc.). | ||
join_type INTEGER NOT NULL, | ||
-- The group id of the left child. | ||
left BIGINT NOT NULL, | ||
-- The group id of the right child. | ||
right BIGINT NOT NULL, | ||
-- The join condition (mocked). | ||
join_cond TEXT NOT NULL, | ||
FOREIGN KEY (physical_expr_id) REFERENCES physical_expr_id(id) ON DELETE CASCADE ON UPDATE CASCADE, | ||
FOREIGN KEY (left) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE, | ||
FOREIGN KEY (right) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE | ||
); | ||
|
||
INSERT INTO physical_op_kinds (name) VALUES ('PhysicalNLJoin'); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
-- Deregisters the logical scan operator. | ||
DELETE FROM logical_op_kinds where name = 'LogicalScan'; | ||
|
||
DROP TABLE logical_scans; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
-- Registers the logical scan operator. | ||
CREATE TABLE logical_scans ( | ||
logical_expr_id INTEGER NOT NULL PRIMARY KEY, | ||
-- Ideally this will be an unique id of the table in the catalog, | ||
-- For now using table name to fake it. | ||
table_name TEXT NOT NULL, | ||
FOREIGN KEY (logical_expr_id) REFERENCES logical_exprs(id) ON DELETE CASCADE ON UPDATE CASCADE | ||
); | ||
|
||
INSERT INTO logical_op_kinds (name) VALUES ('LogicalScan'); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
-- Deregister the physical nested loop join operator. | ||
DELETE FROM physical_op_kinds where name = 'PhysicalTableScan'; | ||
|
||
DROP TABLE physical_table_scans; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
-- Registers the physical table scan operator. | ||
CREATE TABLE physical_table_scans ( | ||
physical_expr_id INTEGER NOT NULL PRIMARY KEY, | ||
-- Ideally this will be an unique id of the table in the catalog, | ||
-- For now using table name to fake it. | ||
table_name TEXT NOT NULL, | ||
FOREIGN KEY (physical_expr_id) REFERENCES physical_expr_id(id) ON DELETE CASCADE ON UPDATE CASCADE | ||
); | ||
|
||
INSERT INTO physical_op_kinds (name) VALUES ('PhysicalTableScan'); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
-- Deregisters the logical filter operator. | ||
DELETE FROM logical_op_kinds where name = 'LogicalFilter'; | ||
|
||
DROP TABLE logical_filters; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
-- Registers the logical filter operator. | ||
CREATE TABLE logical_filters ( | ||
logical_expr_id INTEGER NOT NULL PRIMARY KEY, | ||
-- The group id of the child. | ||
child BIGINT NOT NULL, | ||
-- The filter predicate (e.g. <colA> > 3) (mocked). | ||
predicate TEXT NOT NULL, | ||
FOREIGN KEY (logical_expr_id) REFERENCES logical_exprs(id) ON DELETE CASCADE ON UPDATE CASCADE, | ||
FOREIGN KEY (child) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE | ||
); | ||
|
||
INSERT INTO logical_op_kinds (name) VALUES ('LogicalFilter'); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
-- Deregisters the physical filter operator. | ||
DELETE FROM physical_op_kinds where name = 'PhysicalFilter'; | ||
|
||
DROP TABLE physical_filters; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
-- Registers the physical filter operator. | ||
CREATE TABLE physical_filters ( | ||
physical_expr_id INTEGER NOT NULL PRIMARY KEY, | ||
-- The group id of the child. | ||
child BIGINT NOT NULL, | ||
-- The predicate to filter on (e.g. <colA> > 3) (mocked). | ||
predicate TEXT NOT NULL, | ||
FOREIGN KEY (physical_expr_id) REFERENCES physical_exprs(id) ON DELETE CASCADE ON UPDATE CASCADE, | ||
FOREIGN KEY (child) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE | ||
); | ||
|
||
INSERT INTO physical_op_kinds (name) VALUES ('PhysicalFilter'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still personally think that this should be called
optd-storage
instead of justoptd
because it will become confusingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think last time we decided initially we are going to have a single crate called
optd
.storage
will be a module inside that crate.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, then we arguably shouldn't have a workspace at all.
With what @SarveshOO7 was working on, I think it makes sense to have a separate crate for everything around the pipeline we talked about because we do not want to have to recompile datafusion every single time we make a change.
I think either we completely remove the workspace and call it
optd
, or we start with 2 crates (something likeoptd-core
andoptd-datafusion
) and make sure we don't expand beyond 2 without good reason.