Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvme_driver: test: hardening the save_restore testing for the nvme driver to verify keepalive functionality #815

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

gurasinghMS
Copy link
Contributor

@gurasinghMS gurasinghMS commented Feb 7, 2025

For now, testing the nvme_keepalive functionality for the nvme driver is limited to only running save() and then restore() to verify that restore() is able to run without panics. This does not test for values being reassigned properly after restore() and neither does it check the underlying memory to be the same; the underlying buffers and their contents must be the same after restore.
The following PR will add verify_restore functions to each of the components touched by nvme_keepalive and then call verify_restore() after running restore(). This will harden the testing for the nvme_driver:keepalive to verify current functionality and prevent future regression.

Fixes #817

let worker = task.task();

// Verify Admin Queue
// TODO: [expand-verify-restore-functionality] Currrently providing base_pfn value in u64, this might panic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tag tracked by #818 , but maybe we can just get it all part of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on just getting everything in 1 PR to avoid breaking up things in to super small parts tbh which is why I didn't have an issue opened for it yet.

let cq_verify = self.cq.verify_restore(saved_state.cq_state.clone());
let pending_cmds_verify = self.commands.verify_restore(saved_state.pending_cmds.clone());

if let Err(_) = sq_verify {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cleaner as:

verify_state.complete(sq_verify.and(cq_verify).and(pending_cmds_verify));

Or even better (imo), but you'd need to check if order of operations is important to you ...:

verify_state.complete(
self.sq.verify_restore(saved_state.sq_state.clone()).and(
self.sq.verify_restore(saved_state.sq_state.clone())
).and(
self.commands.verify_restore(saved_state.pending_cmds.clone())
));

I'm particularly nervous about forgetting something in the future. See https://doc.rust-lang.org/std/result/#boolean-operators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed since, the verify_restore function should panic if there is an error so we shouldn't need to check the output values anymore.

@gurasinghMS gurasinghMS marked this pull request as ready for review February 20, 2025 00:06
@gurasinghMS gurasinghMS requested review from a team as code owners February 20, 2025 00:06
@gurasinghMS gurasinghMS marked this pull request as draft February 20, 2025 00:35
@gurasinghMS gurasinghMS force-pushed the user/gurasingh/verify-restore-for-reattach-test branch from 519df41 to 6e2bd10 Compare February 20, 2025 22:10
@@ -270,15 +289,24 @@ pub struct EmulatedDmaAllocator {
shared_mem: DeviceSharedMemory,
}

impl EmulatedDmaAllocator {
pub fn new(shared_mem: DeviceSharedMemory) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a new function for this removed? It suddenly stopped working for me for some reason so I had to write this instead!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were few changes in this area recently

@gurasinghMS gurasinghMS marked this pull request as ready for review February 21, 2025 21:33
/// Given an input of the saved state from which the driver was constructed and the underlying
/// memory, this validates the current driver.
#[cfg(test)]
pub(crate) async fn verify_restore(&mut self, saved_state: &NvmeDriverSavedState, mem: MemoryBlock) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what the purpose of this is. Is this to make sure that restore restores the thing? It just seems like it's duplicating the logic in restore. Is that valuable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put it differently, I don't see the value of this kind of "invasive" testing--we're just checking to see if restore puts the internal state of our object in the state that we expect. But if we screwed up restore, isn't it just as likely that we screwed up verify_restore? Now we have to get it right in two places.

A more effective test would be to validate that after restore, the driver is in the state we expect by observing the behavior of the driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point! Initially I was coming at this as something that would test but also prevent regression of restore in the future. Kind of a basic check to verify that values were set correctly. For example, if restore functionality was to ever regress to restore every submission queue with one entry less than intended. This theoretically would not be an invalid state for the driver (I think) but would definitely be incorrect since multiple restores would shrink the queue significantly. In my head, coming up with and testing for every such edge case by looking at behavior alone might prove to be hard.

What I am wondering is, does this type of testing that I built out provide any value at all? Do you think I should scrap this or should I be adding more tests to verify behavior on top of this? I am happy to go either way but interested to know your thoughts on how to proceed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a first step we just need to re-enable restore() in unit tests. The reason why it was disabled in first place is that it used attach_dma_buffer call in EmulatedDma, but the team's feedback was to remove it from EmulatedDma because it was mocking non-existing functionality.

You re-added attach_dma_buffer so I expect that team's feedback will be same, but let's see...

We need to redesign some interfaces to allow unit tests to mock pfn-based access. Or, maybe, hide pfn details in a separate module.

As a next step we would like to generate I/O traffic before/after servicing.

// TODO: Can this be an associated function instead?
#[cfg(test)]
pub(crate) fn verify_restore(&self, saved_state: &SubmissionQueueSavedState) {
assert_eq!(saved_state.sqid, self.sqid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions about these checks:

  1. Is there a reason to think that saved state value may be different from the restored state value? We literally create a new object from the saved values. If it checks if we did not forget to restore some fields then this can be guaranteed by "destructuring" fields which should be used in most cases, and compiler will give an error if something is missing.
  2. I would generally prefer to move these checks to the tests.rs. We can add public functions to read private values from the object.

fn attach_dma_buffer(&self, _len: usize, _base_pfn: u64) -> anyhow::Result<MemoryBlock> {
anyhow::bail!("restore is not supported for emulated DMA")
fn attach_dma_buffer(&self, len: usize, base_pfn: u64) -> anyhow::Result<MemoryBlock> {
let memory = MemoryBlock::new( self.shared_mem.alloc_specific(len, base_pfn.try_into().unwrap()).context("could not alloc specific. out of memory")?);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess @chris-oo should chime in on this one.

// ===== COPY MEMORY =====
let host_allocator_original = EmulatedDmaAllocator::new(shared_mem_original.clone());
let mem_original= DmaClient::attach_dma_buffer(&host_allocator_original, base_len, 0).unwrap();
copy_mem_block(&mem_original, shared_mem_copy.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would swap the definition. Since keepalive feature intends to restore data into same memory block, see if we can implement this sequence instead:

  1. Use original block for 1st controller
  2. Save
  3. Copy original block contents to a backup copy
  4. Restore to original block
  5. Compare original block with the backup copy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nvme_driver: first save/restore test
4 participants