Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider not using on_disk_hash #1161

Open
mkeeter opened this issue Feb 14, 2024 · 0 comments
Open

Consider not using on_disk_hash #1161

mkeeter opened this issue Feb 14, 2024 · 0 comments

Comments

@mkeeter
Copy link
Contributor

mkeeter commented Feb 14, 2024

Right now, the Downstairs hashes twice:

  • To verify the incoming BlockContext::hash, it hashes a tuple of [nonce, tag, data] (or just data for unencrypted blocks)
  • Then, to generate DownstairsBlockContext::on_disk_hash, it hashes only data. This is also written to disk, alongside the BlockContext (with its own hash, discussed above)

I think the on_disk_hash is unnecessary. The purpose of on_disk_hash is to tell us which context data is valid at startup, but BlockContext::hash can serve exactly the same purpose: whichever context has a matching hash is correct.

In fact, most of the time we don't actually need a full rehash: we persist the active slots when clearing the dirty flag, which happens on every flush.

Skipping the computation of on_disk_hash would hopefully speed up writes: in a flamegraph of 1M random writes post-#1148 (which removes the dominance of __fdsync), I see the two calls to common_integrity_hash taking 10% and 6% of runtime respectively:

Screenshot 2024-02-14 at 5 20 33 PM

The downside is that recovering from a dirty shutdown would take twice as long, because we have to hash both [nonce1, tag1, data] and [nonce2, tag2, data]. (It's too bad we didn't choose the order [data, nonce, tag], which would have let us reuse hasher state and made this extra cost negligible)

I think it's still probably worthwhile – dirty shutdowns should be rare, and only the specific dirty extents would pay the extra price.

We could start doing this Right Now, without any on-disk format changes:

  • Use BlockContext::hash instead of on_disk_hash when checking for slot validity
  • Write on_disk_hash to all zeros

Since these changes are confined to the crucible-downstairs binary, we can make these two behavior changes atomically in a single release. It is also not a compatibility break with the upstairs, which only cares about BlockContext::hash (the on_disk_hash is never sent over the network). However, we wouldn't be able to roll back to previous crucible-downstairs releases!

We could also theoretically remove the on_disk_hash, saving 8 bytes from the OnDiskDownstairsBlockContext (so 16 bytes per block, since we have two slots per block). This is harder to do in place; if we wanted to make that change, we'd probably implement a RawInnerV2. I'd vote to not do that out of the gate.

mkeeter added a commit that referenced this issue Jul 31, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Right now, the `ReadResponse` sends back a `Vec<Option<BlockContext>>`,
i.e. an `Option<BlockContext>` for each block. The `BlockContext` in
turn contains

```rust
pub struct BlockContext {
    pub hash: u64,
    pub encryption_context: Option<EncryptionContext>,
}
```

If `encryption_context` is populated, then sending `hash` to the
Upstairs is unnecessary: we are using authenticated encryption, so we
know whether the block data + context is valid based on whether it
decrypted successfully.

This PR removes `hash` from the `ReadResponse` message in favor of a new
`enum`:
```rust
#[derive(Debug, PartialEq, Copy, Clone, Serialize, Deserialize)]
pub enum ReadBlockContext {
    Empty,
    Encrypted { ctx: EncryptionContext },
    Unencrypted { hash: u64 },
}
```

This does not change the on-disk format or write message format, which
both continue to use hashes:

- When **sending** data to the Downstairs, the hash (in `BlockContext`)
lets us detect corruption in transit. We can't use the encryption
context here, because the Downstairs isn't allowed to have encryption
keys
- When recovering from a bad write, `on_disk_hash` is used to figure out
which context slot is valid. `on_disk_hash` is never sent over the
network¹

This PR is a step towards [RFD 490 § Metadata
reduction](https://rfd.shared.oxide.computer/rfd/490#_metadata_reduction),
which proposes to **not** store block hashes for encrypted data on disk.
If in the future we don't store block hashes for encrypted data, we
would not be able to send them over the network; this PR removes that
future hurdle.

However, the PR stands alone as a small optimization (39 → 32 bytes per
block) that simplifies program behavior (no need to think about what
happens if encryption fails but the hash matches, or vice versa).

¹ `on_disk_hash` is also _technically_ superfluous if we already have
`hash` (see #1161), but
this PR doesn't change it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant