Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify block context for reads #1391

Merged
merged 3 commits into from
Jul 31, 2024

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Jul 23, 2024

Right now, the ReadResponse sends back a Vec<Option<BlockContext>>, i.e. an Option<BlockContext> for each block. The BlockContext in turn contains

pub struct BlockContext {
    pub hash: u64,
    pub encryption_context: Option<EncryptionContext>,
}

If encryption_context is populated, then sending hash to the Upstairs is unnecessary: we are using authenticated encryption, so we know whether the block data + context is valid based on whether it decrypted successfully.

This PR removes hash from the ReadResponse message in favor of a new enum:

#[derive(Debug, PartialEq, Copy, Clone, Serialize, Deserialize)]
pub enum ReadBlockContext {
    Empty,
    Encrypted { ctx: EncryptionContext },
    Unencrypted { hash: u64 },
}

This does not change the on-disk format or write message format, which both continue to use hashes:

  • When sending data to the Downstairs, the hash (in BlockContext) lets us detect corruption in transit. We can't use the encryption context here, because the Downstairs isn't allowed to have encryption keys
  • When recovering from a bad write, on_disk_hash is used to figure out which context slot is valid. on_disk_hash is never sent over the network¹

This PR is a step towards RFD 490 § Metadata reduction, which proposes to not store block hashes for encrypted data on disk. If in the future we don't store block hashes for encrypted data, we would not be able to send them over the network; this PR removes that future hurdle.

However, the PR stands alone as a small optimization (39 → 32 bytes per block) that simplifies program behavior (no need to think about what happens if encryption fails but the hash matches, or vice versa).

¹ on_disk_hash is also technically superfluous if we already have hash (see #1161), but this PR doesn't change it

@mkeeter mkeeter requested review from faithanalog, jmpesp and leftwo July 23, 2024 18:12
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question inline about a match that returns None, when maybe it shoudl be an error?

"expected encrypted but got unencrypted \
block context"
);
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return None here, and the data is all blank, will this result in
returniung Ok(Validation::Empty) from validate_encrypted_read_response()?

Copy link
Contributor Author

@mkeeter mkeeter Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in 410a874 (returning an error)

"expected unencrypted but got encrypted \
block context"
);
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above about returning None when we found encrypted context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same fix as above!

@leftwo leftwo self-requested a review July 24, 2024 22:06
@jmpesp
Copy link
Contributor

jmpesp commented Jul 26, 2024

If encryption_context is populated, then sending hash to the Upstairs is unnecessary: we are using authenticated encryption, so we know whether the block data + context is valid based on whether it decrypted successfully.

Isn't it necessary? Without it, the Upstairs can't distinguish between corruption in transit and a block where either the data or encryption context is corrupted. Those are two different scenarios where the Upstairs should take different actions (retry or bail).

@mkeeter
Copy link
Contributor Author

mkeeter commented Jul 26, 2024

Isn't it necessary? Without it, the Upstairs can't distinguish between corruption in transit and a block where either the data or encryption context is corrupted.

I don't think adding the hash will help us to distinguish between these two cases.

Right now, the Downstairs doesn't verify the hash before sending it over the wire, so corruption in transit versus on disk both look the same: the hash or encryption context doesn't match at the Upstairs. Sending the hash doesn't give us any extra information about where corruption happens, because we don't guarantee that the hash is valid before sending it.

(I think this is intentional: having the Downstairs check the hash would add a bunch of overhead, and we've already got ZFS doing checksums under the hood for us)

@jmpesp
Copy link
Contributor

jmpesp commented Jul 26, 2024

I think it depends on whether or not we will ever receive a corrupted data back from a ZFS* read instead of an error.

If it is possible for the Downstairs to receive corrupted data back from a read, then there isn't a meaningful difference between what the integrity hash allows the Upstairs to distinguish over having the encryption context: a corrupted read would mean that both the integrity hash doesn't match and that the encryption context won't decrypt the block, and the Upstairs cannot distinguish between these. Asking for a retry because the integrity hash is wrong (thinking that it was corrupted on the wire) won't fix it.

If it is not possible to receive corrupted data back from a read, then there is a difference: the Downstairs does validate the integrity hash before accepting writes, which (if validation succeeds) means that the data was not corrupted in transit, and if the data comes back at all from a read then it wasn't corrupted at rest either. Then Downstairs doesn't have to validate the hash before sending back to the Upstairs because the only other possible corruption of the data is in transit back to the Upstairs.

*: this doesn't apply to non-checksumming filesystems, but our production deployments always use ZFS :)

@mkeeter
Copy link
Contributor Author

mkeeter commented Jul 26, 2024

If it is possible for the Downstairs to receive corrupted data back from a read, then there isn't a meaningful difference between what the integrity hash allows the Upstairs to distinguish over having the encryption context: a corrupted read would mean that both the integrity hash doesn't match and that the encryption context won't decrypt the block, and the Upstairs cannot distinguish between these. Asking for a retry because the integrity hash is wrong (thinking that it was corrupted on the wire) won't fix it.

Yes, agreed – in a world where ZFS can give us corrupted data, we can't tell the difference between in-transit and on-disk corruption, so sending the hash doesn't give us extra information.

If it is not possible to receive corrupted data back from a read, then there is a difference: the Downstairs does validate the integrity hash before accepting writes, which (if validation succeeds) means that the data was not corrupted in transit, and if the data comes back at all from a read then it wasn't corrupted at rest either. Then Downstairs doesn't have to validate the hash before sending back to the Upstairs because the only other possible corruption of the data is in transit back to the Upstairs.

I also agree with this, but think it proves more than what you're saying – in fact, it proves that sending the hash is superfluous 😄

In a world where ZFS cannot give us corrupted data:

  • The Downstairs validates the combination of write data and encryption context against the received hash (here), before writing to disk, so we know that both data and encryption context were not corrupted in transit from Upstairs → Downstairs
  • If the data and encryption context come back from a read, then they weren't corrupted on disk (baseline assumption for this section)
  • Therefore, the only possibility for corruption of data or encryption context is in transit from Downstairs → Upstairs
  • Therefore, sending the hash doesn't give us any extra information

This analysis ignores cases where someone deliberately corrupts files on disk; in such a situation, the attacker has full control over whether the hash matches or not, so it still doesn't give us an extra information.

@jmpesp
Copy link
Contributor

jmpesp commented Jul 30, 2024

Yes, agreed – in a world where ZFS can give us corrupted data, we can't tell the difference between in-transit and on-disk corruption, so sending the hash doesn't give us extra information.

Ignoring in-transit corruption for a sec, if the Downstairs can receive corrupted data back from ZFS, then it can tell if the integrity hash is wrong and send an appropriate error to the Upstairs. We don't currently check the hash because we believe that ZFS can't return us corrupted data: I think this is a safe assumption to keep making, but if it isn't then we can invent a new error type as appropriate.

Therefore, sending the hash doesn't give us any extra information

I still don't follow. Right now the behaviour of validate_encrypted_read_response is:

    // We'll start with decryption, ignoring the hash; we're using authenticated
    // encryption, so the check is just as strong as the hash check.
    //
    // Note: decrypt_in_place does not overwrite the buffer if
    // it fails, otherwise we would need to copy here. There's a
    // unit test to validate this behaviour.
    use aes_gcm_siv::{Nonce, Tag};
    let decryption_result = encryption_context.decrypt_in_place(
        data,
        Nonce::from_slice(&block_encryption_ctx.nonce[..]),
        Tag::from_slice(&block_encryption_ctx.tag[..]),
    );
    if decryption_result.is_ok() {
        Ok(Validation::Encrypted(*block_encryption_ctx))
    } else {
        // Validate integrity hash before decryption
        let computed_hash = integrity_hash(&[
            &block_encryption_ctx.nonce[..],
            &block_encryption_ctx.tag[..],
            data,
        ]);

        if computed_hash == context.hash {
            // No encryption context combination decrypted this block, but
            // one valid hash was found. This can occur if the decryption
            // key doesn't match the key that the data was encrypted with.
            error!(log, "Decryption failed with correct hash");
            Err(CrucibleError::DecryptionError)
        } else {
            error!(
                log,
                "No match for integrity hash\n\
             Expected: 0x{:x} != Computed: 0x{:x}",
                context.hash,
                computed_hash
            );
            Err(CrucibleError::HashMismatch)
        }
    }

I agree with the ordering here: attempt authenticated decryption first, then only if it fails check the integrity hash. The authenticated decryption is a "stronger" check anyway. If the block was corrupted in transit, as it's written today this function will likely return HashMismatch instead of DecryptionError.

Following where that error is checked though, both process_io_completion_inner and process_io_completion right now panic for both error types:

process_io_completion_inner does:

            Some(CrucibleError::DecryptionError) => {
                // We should always be able to decrypt the data.  If we
                // can't, then we have the wrong key, or the data (or key)
                // is corrupted.
                error!(
                    self.clients[client_id].log,
                    "Authenticated decryption failed on job: {:?}", ds_id
                );
                panic!(
                    "[{}] Authenticated decryption failed on job: {:?}",
                    client_id, ds_id
                );
            }

and process_io_completion does:

                        // If a read job fails, we sometimes need to panic.
                        IOop::Read { .. } => {
                            // It's possible we get a read error if the
                            // downstairs disconnects. However XXX, someone
                            // should be told about this error.
                            //
                            // Some errors, we need to panic on.
                            match e {
                                CrucibleError::HashMismatch => {
                                    panic!(
                                        "[{}] {} read hash mismatch {:?} {:?}",
                                        self.client_id, ds_id, e, job
                                    );
                                }
                                CrucibleError::DecryptionError => {
                                    panic!(
                                        "[{}] {} read decrypt error {:?} {:?}",
                                        self.client_id, ds_id, e, job
                                    );
                                }

We don't have anything where the Upstairs would re-request the read result or rerun the read job if it detects a hash mismatch like this. Given a network problem that corrupts packets, a whole bunch of Upstairs could panic, then on some sort of reboot (of the instance or whatnot) all go into reconciliation due to the panics leaving them in a state that requires it. The Upstairs could be smarter about not panicking on a hash mismatch, and somehow degrade gracefully instead.

Writing all this out I think there's a broader problem to address, because such a network problem could potentially corrupt any frame that is sent between the Upstairs and Downstairs, probably leading to a whole bunch of error reading from FramedReader errors on both sides. We should probably be embedding a checksum into the frame itself to detect this? This may also help reject invalid frames (like when I mistakenly was issueing GET requests against what turned out to be a Downstairs!) I'm ok not recording integrity hash in efforts to reduce stored metadata if we instead perform that checksum check at a "lower" level than the Message, so to speak. What to you think?

@jmpesp
Copy link
Contributor

jmpesp commented Jul 30, 2024

Yeah, I think I get it now. I had to draw this out, assuming that network corruption would manifest as bit flips:

  ----------------- ----------------- ----------------- -----------------
                                      Bit flip in hash  

                                      **No**            **Yes**

  Bit flip in       **No**            Decryption        Decryption
  data + context                      success           success
                                      hash match        hash mismatch

                    **Yes**           Decryption fail   Decryption fail
                                      hash mismatch     hash mismatch
  ----------------- ----------------- ----------------- -----------------

Having the hash doesn't help us, because when decryption fails there's a hash mismatch, and when decryption succeeds the hash doesn't matter. Sorry I was so confused about this. In fact I think the only part where we can reliably detect packet corruption is the case where the decryption succeeds and there's a hash mismatch, meaning we can say there was a bit flip in the integrity hash part of the frame.

I was hoping to come out of this with some ability of the Upstairs to tell the difference between a decryption failure it can somehow retry (by re-requesting the block or something), and a decryption failure it knows not to retry (due to on-disk corruption). If the Downstairs only validates the hash that is sent by the Upstairs before a write, and the write is successful, then corruption can occur via the disk or network and the Upstairs can't tell the difference. This makes me uncomfortable w.r.t the potential for cascading reconciliations. I was (again) hoping that we could detect network corruption and not attempt reconciliation over what we can determine to be an unreliable network.

I'll copy what you said in a DM here for posterity: we either have to

  • Check the integrity hash downstairs before sending it down the wire (detects on-disk corruption, requires storing a hash on disk, in-transit corruption is detected when stuff fails to decrypt)
  • Add a message hash to the on-the-wire message (detects in-transit corruption, ZFS corruption is detected when stuff fails to decrypt)

where unfortunately both of those options require computing a full hash of the block data in the Downstairs.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, we can follow up with "the broader issue" separately. Some comments and questions:

// this block context is missing an encryption context!
return Err(CrucibleError::DecryptionError);
};

// We'll start with decryption, ignoring the hash; we're using authenticated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment needs updating, and perhaps an additional line or two about how packet corruption currently causes decryption failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, done in 5ae6bd0

@@ -570,12 +573,19 @@ pub struct WriteHeader {
pub contexts: Vec<BlockContext>,
}

#[derive(Debug, PartialEq, Copy, Clone, Serialize, Deserialize)]
pub enum ReadBlockContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this stronger type over the Option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too 😀

// We should get back our old data, because block 0 was written.
assert_eq!(
resp.blocks,
vec![ReadBlockContext::Unencrypted { hash: prev_hash }]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the value being tested changed here - was this test wrong before? How did this previously pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, it was

assert_ne!(resp.blocks, vec![Some(block_context)]);

i.e. checking that we don't read back the new value

Now, it's assert_eq! and checks that we do read back the old value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool cool

@mkeeter mkeeter merged commit 07cb9ed into oxidecomputer:main Jul 31, 2024
18 checks passed
@mkeeter mkeeter deleted the block-read-context branch July 31, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants