Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: minimalistic storage layer setup #11

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ jobs:
uses: dtolnay/install@cargo-docs-rs
- name: cargo docs-rs
# TODO: Once we figure out the crates, rename this.
run: cargo docs-rs -p optd-tmp
run: cargo docs-rs -p optd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still personally think that this should be called optd-storage instead of just optd because it will become confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think last time we decided initially we are going to have a single crate called optd. storage will be a module inside that crate.

Copy link
Member

@connortsui20 connortsui20 Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, then we arguably shouldn't have a workspace at all.

With what @SarveshOO7 was working on, I think it makes sense to have a separate crate for everything around the pipeline we talked about because we do not want to have to recompile datafusion every single time we make a change.

I think either we completely remove the workspace and call it optd, or we start with 2 crates (something like optd-core and optd-datafusion) and make sure we don't expand beyond 2 without good reason.

hack:
# cargo-hack checks combinations of feature flags to ensure that features are all additive
# which is required for feature unification
Expand Down
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,11 @@ Cargo.lock
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
#.idea/

### Project Specific ###

# The memo table database for testing purposes.
test_memo.db
# Storing environment variables.
.env
16 changes: 15 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
[workspace]
members = ["optd-tmp"]
members = ["optd"]
resolver = "2"

[workspace.dependencies]
anyhow = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to use snafu given the service-like-library nature of this project

chrono = "0.4.39"
diesel = { version = "2.2", features = [
"sqlite",
"returning_clauses_for_sqlite_3_35",
"chrono",
] }
enum_dispatch = "0.3"
# Using a bundled version of sqlite3-sys to avoid build issues.
libsqlite3-sys = { version = "0.30", features = ["bundled"] }
dotenvy = "0.15"
diesel_migrations = "2.2"
15 changes: 15 additions & 0 deletions diesel.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# For documentation on how to configure this file,
# see https://diesel.rs/guides/configuring-diesel-cli

[print_schema]
# The file diesel will write the generated schema to.
file = "optd/src/storage/schema.rs"


# A column of type `INTEGER PRIMARY KEY` becomes an alias for the 64-bit signed integer `ROWID`.
# See https://sqlite.org/autoinc.html for more details.
sqlite_integer_primary_key_is_bigint = true

[migrations_directory]
# The directory where the migration files are located.
dir = "optd/migrations"
1 change: 1 addition & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
# Contributor Guide

- [Installaton]()
- [Working with diesel-rs](./contributor_guide/diesel.md)

# RFCs

Expand Down
55 changes: 55 additions & 0 deletions docs/src/contributor_guide/diesel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Working with diesel-rs

[Diesel](https://diesel.rs/) is an ORM framework we use to persist the core objects in the optd query optimizer. We chose to work with Diesel instead of other alternatives mainly for its compile-time safety guarantees, which is a good companion for our table-per-operator-kind model.

This guide assumes that you already have the `sqlite3` binary installed.

## Setup

When working with Diesel for the first time, you could use the convenient setup scripts located at `scripts/setup.sh`. The script will install the Diesel CLI tool, generate a testing memo table database at project root, and run the Diesel setup script.

For more details, follow the [Getting Started with Diesel](https://diesel.rs/guides/getting-started.html) guide.

## Making changes

To generate a new migration, use the following command:

```shell
diesel migration generate <migration_name>
```

Diesel CLI will create two empty files in the `optd-storgage/migrations` folder. You will see output that looks something like this:

```shell
Creating optd-storage/migrations/2025-01-20-153830_<migration_name>/up.sql
Creating optd-storage/migrations/2025-01-20-153830_<migration_name>/down.sql
```

The `up.sql` file should contain the changes you want to apply and `down.sql` should contain the command to revert the changes.

Before optd becomes stable, it is ok to directly modify the migrations themselves.

To apply the new migration, run:

```shell
diesel migration run
```

You can also check that if `down.sql` properly revert the change:

```shell
diesel migration redo [-n <REDO_NUMBER>]
```

You can also use the following command to revert changes:

```shell
diesel migration revert [-n <REVERT_NUMBER>]

## Adding a new operator

(TODO)

## Adding a new property

(TODO)
6 changes: 0 additions & 6 deletions optd-tmp/Cargo.toml

This file was deleted.

14 changes: 0 additions & 14 deletions optd-tmp/src/lib.rs

This file was deleted.

13 changes: 13 additions & 0 deletions optd/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[package]
name = "optd"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow.workspace = true
chrono.workspace = true
diesel.workspace = true
diesel_migrations.workspace = true
dotenvy.workspace = true
enum_dispatch.workspace = true
libsqlite3-sys.workspace = true
Empty file added optd/migrations/.keep
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DROP TABLE rel_groups;
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
-- A relational group contains a set of relational expressions
-- that are logically equivalent.
CREATE TABLE rel_groups (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think expand out the "relational" here, and also be consistent with what the migration file directory is called too

-- The group identifier.
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
-- Time at which the group is created.
created_at TIMESTAMP DEFAULT (CURRENT_TIMESTAMP) NOT NULL
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add TODOs here saying that we will add other columns like an optional winner, potentially cost, and also other group metadata related to our union-find parent pointer idea

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DROP TABLE logical_op_kinds;
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
-- The logical operator descriptors table specifies all the
-- logical operators that can be used in optimizer.
CREATE TABLE logical_op_kinds (
-- The identifier of the logical operator.
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
-- The name of the logical operator.
name TEXT NOT NULL
);
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DROP TABLE physical_op_kinds;
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
-- The physical operator descriptor table stores all the
-- physical operators that can be used in the optimizer.
CREATE TABLE physical_op_kinds (
Copy link
Collaborator

@AlSchlo AlSchlo Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I think having an enum here might be more practical. It will also make all the queries simpler and safe us a join. Might as well codegen everywhere, right?

This table will never change at run time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll experiment with that during the WE.

Copy link
Member

@connortsui20 connortsui20 Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, even though I haven't actually given the logic of this PR a review, we should be pushing as much work as we can to compile-time / codegen / preprocessing, especially since we want the foundation to be minimal at runtime so that we don't run into a wave of runtime bugs in the future

-- The identifier of the physical operator.
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
-- The name of the physical operator.
name TEXT NOT NULL
);
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DROP TABLE logical_exprs;
14 changes: 14 additions & 0 deletions optd/migrations/2025-01-22-223940_create_logical_exprs/up.sql
Copy link
Member

@connortsui20 connortsui20 Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there might be an unnecessary level of indirection here?

For the sake of this example, suppose we only have logical expression types. My understanding of this is that there is a logical_exprs table that tracks expression IDs and maps them to their logical operator kind (Scan, Filter, Join). The purpose of this is so that every logical expression ID lookup can be paired with a logical operator kind so that we know what specific table to go look at.

What if instead of having a full table that makes this mapping, we instead encode the operator kind inside of the expression ID? For example, we could have it such that the upper 16 bits of a 64-bit ID encode the operator kind, and the lower 48 bits encode the unique expression given the operator kind (we would likely not even need all 16 bits, we could probably get away with the top 8 bits even).

You could probably argue that this introduces complexity, but in my mind I feel that having extra tables that are not strictly necessary introduces even more complexity. When I was trying to read this PR I had to spend a non-trivial amount of time trying to understand the architecture, since I did not expect that specific table when we initially talked about the architecture in previous meetings.

In terms of implementation of the above proposal, yes it would be a bit ugly, but if abstracted correctly at the storage layer level, there should only really be 1 function that handles the conversion into some sort of strongly-typed struct like this:

struct ExprId {
    kind: u16, // could also encode logical/physical difference before converting into stronger types
    id: u64,   // truncated at 48 bits
}

This is opposed to having a dedicated table that makes this mapping, where every single expression lookup now has to look up and extra record on disk (and even in memory, this roundtrip is going to need to happen for every single lookup during optimization).

As for the group ID mapping, I think that needs to be in a separate junction table regardless for group operations. Unless this is actually supposed to be the junction table? If that's the case, then all 3 things in this table would have needed to be a composite primary key tuple.

I think there are tradeoffs to both models, but I am leaning towards removing the layer of indirection.

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
-- The relational logical expressions table specifies
-- which group a logical expression belongs to.
CREATE TABLE logical_exprs (
-- The logical expression id.
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with Chi's suggestion of having a unique ID for each object we shouldn't use autoincrement here

-- The logical operator descriptor id.
logical_op_kind_id BIGINT NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't need to be BIGINT here? Or do foreign key references HAVE to be BIGINT?

-- The group this logical expression belongs to.
group_id BIGINT NOT NULL, -- groups.id
-- Time at which the logical expression is created.
created_at TIMESTAMP DEFAULT (CURRENT_TIMESTAMP) NOT NULL,
FOREIGN KEY (logical_op_kind_id) REFERENCES logical_op_kinds(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (group_id) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE
);
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DROP TABLE physical_exprs;
17 changes: 17 additions & 0 deletions optd/migrations/2025-01-22-224147_create_physical_exprs/up.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
-- The relational physical expressions table specifies which group
-- a physical expression belongs to and the total cost for executing
-- a physical plan rooted at this expression.
CREATE TABLE physical_exprs (
-- The physical expression id.
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
-- The physical operator descriptor id.
physical_op_kind_id BIGINT NOT NULL,
-- The group this physical expression belongs to.
group_id BIGINT NOT NULL,
-- The total cost for executing a physical plan rooted at this expression (FAKE).
total_cost DOUBLE NOT NULL,
-- Time at which the physical expression is created.
created_at TIMESTAMP DEFAULT (CURRENT_TIMESTAMP) NOT NULL,
FOREIGN KEY (physical_op_kind_id) REFERENCES physical_op_kinds(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (group_id) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE
);
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
-- Deregisters the logical join operator.
DELETE FROM logical_op_kinds where name = 'LogicalJoin';

DROP TABLE logical_joins;
17 changes: 17 additions & 0 deletions optd/migrations/2025-01-22-231932_create_logical_joins/up.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
-- Registers the logical join operator.
CREATE TABLE logical_joins (
logical_expr_id INTEGER NOT NULL PRIMARY KEY,
-- The type of join (inner, left, right, etc.).
join_type INTEGER NOT NULL,
-- The group id of the left child.
left BIGINT NOT NULL,
-- The group id of the right child.
right BIGINT NOT NULL,
-- The join condition (mocked).
join_cond TEXT NOT NULL,
FOREIGN KEY (logical_expr_id) REFERENCES logical_exprs(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (left) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (right) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE
);

INSERT INTO logical_op_kinds (name) VALUES ('LogicalJoin');
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
-- Deregister the physical nested loop join operator.
DELETE FROM physical_op_kinds where name = 'PhysicalNLJoin';

DROP TABLE physical_nljoins;

Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
-- Registers the physical nested loop join operator.
CREATE TABLE physical_nljoins (
physical_expr_id INTEGER NOT NULL PRIMARY KEY,
-- The type of join (inner, left, right, etc.).
join_type INTEGER NOT NULL,
-- The group id of the left child.
left BIGINT NOT NULL,
-- The group id of the right child.
right BIGINT NOT NULL,
-- The join condition (mocked).
join_cond TEXT NOT NULL,
FOREIGN KEY (physical_expr_id) REFERENCES physical_expr_id(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (left) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (right) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE
);

INSERT INTO physical_op_kinds (name) VALUES ('PhysicalNLJoin');
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
-- Deregisters the logical scan operator.
DELETE FROM logical_op_kinds where name = 'LogicalScan';

DROP TABLE logical_scans;
10 changes: 10 additions & 0 deletions optd/migrations/2025-01-23-044524_create_logical_scans/up.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
-- Registers the logical scan operator.
CREATE TABLE logical_scans (
logical_expr_id INTEGER NOT NULL PRIMARY KEY,
-- Ideally this will be an unique id of the table in the catalog,
-- For now using table name to fake it.
table_name TEXT NOT NULL,
FOREIGN KEY (logical_expr_id) REFERENCES logical_exprs(id) ON DELETE CASCADE ON UPDATE CASCADE
);

INSERT INTO logical_op_kinds (name) VALUES ('LogicalScan');
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
-- Deregister the physical nested loop join operator.
DELETE FROM physical_op_kinds where name = 'PhysicalTableScan';

DROP TABLE physical_table_scans;
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
-- Registers the physical table scan operator.
CREATE TABLE physical_table_scans (
physical_expr_id INTEGER NOT NULL PRIMARY KEY,
-- Ideally this will be an unique id of the table in the catalog,
-- For now using table name to fake it.
table_name TEXT NOT NULL,
FOREIGN KEY (physical_expr_id) REFERENCES physical_expr_id(id) ON DELETE CASCADE ON UPDATE CASCADE
);

INSERT INTO physical_op_kinds (name) VALUES ('PhysicalTableScan');
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
-- Deregisters the logical filter operator.
DELETE FROM logical_op_kinds where name = 'LogicalFilter';

DROP TABLE logical_filters;

12 changes: 12 additions & 0 deletions optd/migrations/2025-01-23-061054_create_logical_filters/up.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
-- Registers the logical filter operator.
CREATE TABLE logical_filters (
logical_expr_id INTEGER NOT NULL PRIMARY KEY,
-- The group id of the child.
child BIGINT NOT NULL,
-- The filter predicate (e.g. <colA> > 3) (mocked).
predicate TEXT NOT NULL,
FOREIGN KEY (logical_expr_id) REFERENCES logical_exprs(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (child) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE
);

INSERT INTO logical_op_kinds (name) VALUES ('LogicalFilter');
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
-- Deregisters the physical filter operator.
DELETE FROM physical_op_kinds where name = 'PhysicalFilter';

DROP TABLE physical_filters;

12 changes: 12 additions & 0 deletions optd/migrations/2025-01-23-061101_create_physical_filters/up.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
-- Registers the physical filter operator.
CREATE TABLE physical_filters (
physical_expr_id INTEGER NOT NULL PRIMARY KEY,
-- The group id of the child.
child BIGINT NOT NULL,
-- The predicate to filter on (e.g. <colA> > 3) (mocked).
predicate TEXT NOT NULL,
FOREIGN KEY (physical_expr_id) REFERENCES physical_exprs(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (child) REFERENCES rel_groups(id) ON DELETE CASCADE ON UPDATE CASCADE
);

INSERT INTO physical_op_kinds (name) VALUES ('PhysicalFilter');
Loading
Loading