nvme_driver: test: hardening the save_restore testing for the nvme driver to verify keepalive functionality #815

gurasinghMS · 2025-02-07T19:15:51Z

For now, testing the nvme_keepalive functionality for the nvme driver is limited to only running save() and then restore() to verify that restore() is able to run without panics. This does not test for values being reassigned properly after restore() and neither does it check the underlying memory to be the same; the underlying buffers and their contents must be the same after restore.
The following PR will add verify_restore functions to each of the components touched by nvme_keepalive and then call verify_restore() after running restore(). This will harden the testing for the nvme_driver:keepalive to verify current functionality and prevent future regression.

Fixes #817

mattkur · 2025-02-07T20:05:38Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+            let worker = task.task();
+
+            // Verify Admin Queue
+            // TODO: [expand-verify-restore-functionality] Currrently providing base_pfn value in u64, this might panic


this tag tracked by #818 , but maybe we can just get it all part of this PR?

I was planning on just getting everything in 1 PR to avoid breaking up things in to super small parts tbh which is why I didn't have an issue opened for it yet.

mattkur · 2025-02-07T20:15:34Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+                        let cq_verify = self.cq.verify_restore(saved_state.cq_state.clone());
+                        let pending_cmds_verify = self.commands.verify_restore(saved_state.pending_cmds.clone());
+
+                        if let Err(_) = sq_verify {


This is cleaner as:

verify_state.complete(sq_verify.and(cq_verify).and(pending_cmds_verify));

Or even better (imo), but you'd need to check if order of operations is important to you ...:

verify_state.complete( self.sq.verify_restore(saved_state.sq_state.clone()).and( self.sq.verify_restore(saved_state.sq_state.clone()) ).and( self.commands.verify_restore(saved_state.pending_cmds.clone()) ));

I'm particularly nervous about forgetting something in the future. See https://doc.rust-lang.org/std/result/#boolean-operators

This has been changed since, the verify_restore function should panic if there is an error so we shouldn't need to check the output values anymore.

vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs

vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs

…not implemented error

…and pending commands. Calling this from an RPC call now

…Memory

…and issuers in the queue_pair

gurasinghMS · 2025-02-20T22:13:53Z

vm/devices/user_driver/src/emulated.rs

@@ -270,15 +289,24 @@ pub struct EmulatedDmaAllocator {
    shared_mem: DeviceSharedMemory,
 }

+impl EmulatedDmaAllocator {
+    pub fn new(shared_mem: DeviceSharedMemory) -> Self {


Was a new function for this removed? It suddenly stopped working for me for some reason so I had to write this instead!

There were few changes in this area recently

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

… the underlying page is free

…e point we validate memory

jstarks · 2025-02-21T21:42:11Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+    /// Given an input of the saved state from which the driver was constructed and the underlying
+    /// memory, this validates the current driver.
+    #[cfg(test)]
+    pub(crate) async fn verify_restore(&mut self, saved_state: &NvmeDriverSavedState, mem: MemoryBlock) {


I don't quite understand what the purpose of this is. Is this to make sure that restore restores the thing? It just seems like it's duplicating the logic in restore. Is that valuable?

To put it differently, I don't see the value of this kind of "invasive" testing--we're just checking to see if restore puts the internal state of our object in the state that we expect. But if we screwed up restore, isn't it just as likely that we screwed up verify_restore? Now we have to get it right in two places.

A more effective test would be to validate that after restore, the driver is in the state we expect by observing the behavior of the driver.

I see your point! Initially I was coming at this as something that would test but also prevent regression of restore in the future. Kind of a basic check to verify that values were set correctly. For example, if restore functionality was to ever regress to restore every submission queue with one entry less than intended. This theoretically would not be an invalid state for the driver (I think) but would definitely be incorrect since multiple restores would shrink the queue significantly. In my head, coming up with and testing for every such edge case by looking at behavior alone might prove to be hard.

What I am wondering is, does this type of testing that I built out provide any value at all? Do you think I should scrap this or should I be adding more tests to verify behavior on top of this? I am happy to go either way but interested to know your thoughts on how to proceed

As a first step we just need to re-enable restore() in unit tests. The reason why it was disabled in first place is that it used attach_dma_buffer call in EmulatedDma, but the team's feedback was to remove it from EmulatedDma because it was mocking non-existing functionality.

You re-added attach_dma_buffer so I expect that team's feedback will be same, but let's see...

We need to redesign some interfaces to allow unit tests to mock pfn-based access. Or, maybe, hide pfn details in a separate module.

As a next step we would like to generate I/O traffic before/after servicing.

yupavlen-ms · 2025-02-21T21:50:41Z

vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs

+    // TODO: Can this be an associated function instead?
+    #[cfg(test)]
+    pub(crate) fn verify_restore(&self, saved_state: &SubmissionQueueSavedState) {
+        assert_eq!(saved_state.sqid, self.sqid);


Two questions about these checks:

Is there a reason to think that saved state value may be different from the restored state value? We literally create a new object from the saved values. If it checks if we did not forget to restore some fields then this can be guaranteed by "destructuring" fields which should be used in most cases, and compiler will give an error if something is missing.

I would generally prefer to move these checks to the tests.rs. We can add public functions to read private values from the object.

yupavlen-ms · 2025-02-21T21:55:07Z

vm/devices/user_driver/src/emulated.rs

-    fn attach_dma_buffer(&self, _len: usize, _base_pfn: u64) -> anyhow::Result<MemoryBlock> {
-        anyhow::bail!("restore is not supported for emulated DMA")
+    fn attach_dma_buffer(&self, len: usize, base_pfn: u64) -> anyhow::Result<MemoryBlock> {
+        let memory = MemoryBlock::new( self.shared_mem.alloc_specific(len, base_pfn.try_into().unwrap()).context("could not alloc specific. out of memory")?);


I guess @chris-oo should chime in on this one.

yupavlen-ms · 2025-02-21T22:15:00Z

vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs

+    // ===== COPY MEMORY =====
+    let host_allocator_original = EmulatedDmaAllocator::new(shared_mem_original.clone());
+    let mem_original= DmaClient::attach_dma_buffer(&host_allocator_original, base_len, 0).unwrap();
+    copy_mem_block(&mem_original, shared_mem_copy.clone());


Ideally we would swap the definition. Since keepalive feature intends to restore data into same memory block, see if we can implement this sequence instead:

Use original block for 1st controller

Save

Copy original block contents to a backup copy

Restore to original block

Compare original block with the backup copy

mattkur reviewed Feb 7, 2025

View reviewed changes

gurasinghMS commented Feb 11, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs Show resolved Hide resolved

gurasinghMS marked this pull request as ready for review February 20, 2025 00:06

gurasinghMS requested review from a team as code owners February 20, 2025 00:06

gurasinghMS marked this pull request as draft February 20, 2025 00:35

gurasinghMS added 17 commits February 20, 2025 13:48

Fixing emulated.rs after rebase

e117566

Added skeleton for testing objects after restore

c640a00

Now checking the device_id and the nvme_keepalive values upon restore

d7ab79a

Added verify_restore for queue pair. incomplete

5c16819

Checking the state of the QueueHandler after reset but currently has …

63589d6

…not implemented error

Added verify restore functions for submission queue completion queue …

d803a55

…and pending commands. Calling this from an RPC call now

Added verify restore for Submission and Completion queues

c8b6b3d

Moved to using panics and asserts instead of the Result return types

cd92495

Added verify restor for PendingCommand too

2d616b5

Added in the verify restore for I/O Queues

1396811

Commented out code to use the TestMapper instead of using AlignedHeap…

1da4fbc

…Memory

Verified with Yuri that we don't need to check the values for cancel …

ccb7c5f

…and issuers in the queue_pair

now checking the self.identify value as well

5e4c82c

Using reference to saved state instead of passing in the object itself

58cdec0

Comments cleanup

c5f7d49

Commiting before doing a rebase on main

cb72ac8

verify restore functionality is working again

6e2bd10

gurasinghMS force-pushed the user/gurasingh/verify-restore-for-reattach-test branch from 519df41 to 6e2bd10 Compare February 20, 2025 22:10

gurasinghMS commented Feb 20, 2025

View reviewed changes

Removing some ununsed comments

76f1a72

gurasinghMS commented Feb 20, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Show resolved Hide resolved

gurasinghMS added 3 commits February 20, 2025 20:34

I updated the alloc_specific functionality to only give out a page if…

47079d6

… the underlying page is free

Fixed the mismatched memory by moving the Verify RPC call to after th…

4c5974c

…e point we validate memory

Pretty-fied the tests

0d1423b

More pretty-fication

536ceed

gurasinghMS marked this pull request as ready for review February 21, 2025 21:33

jstarks reviewed Feb 21, 2025

View reviewed changes

yupavlen-ms reviewed Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvme_driver: test: hardening the save_restore testing for the nvme driver to verify keepalive functionality #815

nvme_driver: test: hardening the save_restore testing for the nvme driver to verify keepalive functionality #815

gurasinghMS commented Feb 7, 2025 •

edited by mattkur

Loading

mattkur Feb 7, 2025

gurasinghMS Feb 7, 2025

mattkur Feb 7, 2025

gurasinghMS Feb 11, 2025

gurasinghMS Feb 20, 2025

yupavlen-ms Feb 21, 2025

jstarks Feb 21, 2025

jstarks Feb 21, 2025

gurasinghMS Feb 21, 2025

yupavlen-ms Feb 22, 2025

yupavlen-ms Feb 21, 2025

yupavlen-ms Feb 21, 2025

yupavlen-ms Feb 21, 2025

nvme_driver: test: hardening the save_restore testing for the nvme driver to verify keepalive functionality #815

Are you sure you want to change the base?

nvme_driver: test: hardening the save_restore testing for the nvme driver to verify keepalive functionality #815

Conversation

gurasinghMS commented Feb 7, 2025 • edited by mattkur Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gurasinghMS commented Feb 7, 2025 •

edited by mattkur

Loading