-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvme_driver: test: hardening the save_restore testing for the nvme driver to verify keepalive functionality #815
base: main
Are you sure you want to change the base?
Conversation
let worker = task.task(); | ||
|
||
// Verify Admin Queue | ||
// TODO: [expand-verify-restore-functionality] Currrently providing base_pfn value in u64, this might panic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tag tracked by #818 , but maybe we can just get it all part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning on just getting everything in 1 PR to avoid breaking up things in to super small parts tbh which is why I didn't have an issue opened for it yet.
let cq_verify = self.cq.verify_restore(saved_state.cq_state.clone()); | ||
let pending_cmds_verify = self.commands.verify_restore(saved_state.pending_cmds.clone()); | ||
|
||
if let Err(_) = sq_verify { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cleaner as:
verify_state.complete(sq_verify.and(cq_verify).and(pending_cmds_verify));
Or even better (imo), but you'd need to check if order of operations is important to you ...:
verify_state.complete(
self.sq.verify_restore(saved_state.sq_state.clone()).and(
self.sq.verify_restore(saved_state.sq_state.clone())
).and(
self.commands.verify_restore(saved_state.pending_cmds.clone())
));
I'm particularly nervous about forgetting something in the future. See https://doc.rust-lang.org/std/result/#boolean-operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been changed since, the verify_restore
function should panic if there is an error so we shouldn't need to check the output values anymore.
…not implemented error
…and pending commands. Calling this from an RPC call now
…and issuers in the queue_pair
519df41
to
6e2bd10
Compare
@@ -270,15 +289,24 @@ pub struct EmulatedDmaAllocator { | |||
shared_mem: DeviceSharedMemory, | |||
} | |||
|
|||
impl EmulatedDmaAllocator { | |||
pub fn new(shared_mem: DeviceSharedMemory) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was a new function for this removed? It suddenly stopped working for me for some reason so I had to write this instead!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were few changes in this area recently
… the underlying page is free
…e point we validate memory
/// Given an input of the saved state from which the driver was constructed and the underlying | ||
/// memory, this validates the current driver. | ||
#[cfg(test)] | ||
pub(crate) async fn verify_restore(&mut self, saved_state: &NvmeDriverSavedState, mem: MemoryBlock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what the purpose of this is. Is this to make sure that restore
restores the thing? It just seems like it's duplicating the logic in restore
. Is that valuable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To put it differently, I don't see the value of this kind of "invasive" testing--we're just checking to see if restore
puts the internal state of our object in the state that we expect. But if we screwed up restore
, isn't it just as likely that we screwed up verify_restore
? Now we have to get it right in two places.
A more effective test would be to validate that after restore
, the driver is in the state we expect by observing the behavior of the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point! Initially I was coming at this as something that would test but also prevent regression of restore in the future. Kind of a basic check to verify that values were set correctly. For example, if restore functionality was to ever regress to restore every submission queue with one entry less than intended. This theoretically would not be an invalid state for the driver (I think) but would definitely be incorrect since multiple restores would shrink the queue significantly. In my head, coming up with and testing for every such edge case by looking at behavior alone might prove to be hard.
What I am wondering is, does this type of testing that I built out provide any value at all? Do you think I should scrap this or should I be adding more tests to verify behavior on top of this? I am happy to go either way but interested to know your thoughts on how to proceed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a first step we just need to re-enable restore()
in unit tests. The reason why it was disabled in first place is that it used attach_dma_buffer
call in EmulatedDma, but the team's feedback was to remove it from EmulatedDma because it was mocking non-existing functionality.
You re-added attach_dma_buffer
so I expect that team's feedback will be same, but let's see...
We need to redesign some interfaces to allow unit tests to mock pfn-based access. Or, maybe, hide pfn details in a separate module.
As a next step we would like to generate I/O traffic before/after servicing.
// TODO: Can this be an associated function instead? | ||
#[cfg(test)] | ||
pub(crate) fn verify_restore(&self, saved_state: &SubmissionQueueSavedState) { | ||
assert_eq!(saved_state.sqid, self.sqid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions about these checks:
- Is there a reason to think that saved state value may be different from the restored state value? We literally create a new object from the saved values. If it checks if we did not forget to restore some fields then this can be guaranteed by "destructuring" fields which should be used in most cases, and compiler will give an error if something is missing.
- I would generally prefer to move these checks to the
tests.rs
. We can add public functions to read private values from the object.
fn attach_dma_buffer(&self, _len: usize, _base_pfn: u64) -> anyhow::Result<MemoryBlock> { | ||
anyhow::bail!("restore is not supported for emulated DMA") | ||
fn attach_dma_buffer(&self, len: usize, base_pfn: u64) -> anyhow::Result<MemoryBlock> { | ||
let memory = MemoryBlock::new( self.shared_mem.alloc_specific(len, base_pfn.try_into().unwrap()).context("could not alloc specific. out of memory")?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess @chris-oo should chime in on this one.
// ===== COPY MEMORY ===== | ||
let host_allocator_original = EmulatedDmaAllocator::new(shared_mem_original.clone()); | ||
let mem_original= DmaClient::attach_dma_buffer(&host_allocator_original, base_len, 0).unwrap(); | ||
copy_mem_block(&mem_original, shared_mem_copy.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would swap the definition. Since keepalive feature intends to restore data into same memory block, see if we can implement this sequence instead:
- Use original block for 1st controller
- Save
- Copy original block contents to a backup copy
- Restore to original block
- Compare original block with the backup copy
For now, testing the nvme_keepalive functionality for the nvme driver is limited to only running save() and then restore() to verify that restore() is able to run without panics. This does not test for values being reassigned properly after restore() and neither does it check the underlying memory to be the same; the underlying buffers and their contents must be the same after restore.
The following PR will add verify_restore functions to each of the components touched by nvme_keepalive and then call verify_restore() after running restore(). This will harden the testing for the nvme_driver:keepalive to verify current functionality and prevent future regression.
Fixes #817