Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvme_driver: test: hardening the save_restore testing for the nvme driver to verify keepalive functionality #815

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
e117566
Fixing emulated.rs after rebase
gurasinghMS Feb 20, 2025
c640a00
Added skeleton for testing objects after restore
gurasinghMS Feb 6, 2025
d7ab79a
Now checking the device_id and the nvme_keepalive values upon restore
gurasinghMS Feb 7, 2025
5c16819
Added verify_restore for queue pair. incomplete
gurasinghMS Feb 7, 2025
63589d6
Checking the state of the QueueHandler after reset but currently has …
gurasinghMS Feb 7, 2025
d803a55
Added verify restore functions for submission queue completion queue …
gurasinghMS Feb 7, 2025
c8b6b3d
Added verify restore for Submission and Completion queues
gurasinghMS Feb 7, 2025
cd92495
Moved to using panics and asserts instead of the Result return types
gurasinghMS Feb 7, 2025
2d616b5
Added verify restor for PendingCommand too
gurasinghMS Feb 7, 2025
1396811
Added in the verify restore for I/O Queues
gurasinghMS Feb 7, 2025
1da4fbc
Commented out code to use the TestMapper instead of using AlignedHeap…
gurasinghMS Feb 11, 2025
ccb7c5f
Verified with Yuri that we don't need to check the values for cancel …
gurasinghMS Feb 11, 2025
5e4c82c
now checking the self.identify value as well
gurasinghMS Feb 11, 2025
58cdec0
Using reference to saved state instead of passing in the object itself
gurasinghMS Feb 11, 2025
c5f7d49
Comments cleanup
gurasinghMS Feb 11, 2025
cb72ac8
Commiting before doing a rebase on main
gurasinghMS Feb 20, 2025
6e2bd10
verify restore functionality is working again
gurasinghMS Feb 20, 2025
76f1a72
Removing some ununsed comments
gurasinghMS Feb 20, 2025
47079d6
I updated the alloc_specific functionality to only give out a page if…
gurasinghMS Feb 21, 2025
4c5974c
Fixed the mismatched memory by moving the Verify RPC call to after th…
gurasinghMS Feb 21, 2025
0d1423b
Pretty-fied the tests
gurasinghMS Feb 21, 2025
536ceed
More pretty-fication
gurasinghMS Feb 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,14 @@ impl IoQueue {
cpu: *cpu,
})
}

#[cfg(test)]
pub(crate) async fn verify_restore(&self, saved_state: &IoQueueSavedState, mem: MemoryBlock) {
self.queue.verify_restore(&saved_state.queue_data, mem);

assert_eq!(saved_state.iv, self.iv as u32);
assert_eq!(saved_state.cpu, self.cpu);
}
}

#[derive(Debug, Inspect)]
Expand Down Expand Up @@ -683,6 +691,49 @@ impl<T: DeviceBacking> NvmeDriver<T> {
pub fn update_servicing_flags(&mut self, nvme_keepalive: bool) {
self.nvme_keepalive = nvme_keepalive;
}

/// Given an input of the saved state from which the driver was constructed and the underlying
/// memory, this validates the current driver.
#[cfg(test)]
pub(crate) async fn verify_restore(&mut self, saved_state: &NvmeDriverSavedState, mem: MemoryBlock) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what the purpose of this is. Is this to make sure that restore restores the thing? It just seems like it's duplicating the logic in restore. Is that valuable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put it differently, I don't see the value of this kind of "invasive" testing--we're just checking to see if restore puts the internal state of our object in the state that we expect. But if we screwed up restore, isn't it just as likely that we screwed up verify_restore? Now we have to get it right in two places.

A more effective test would be to validate that after restore, the driver is in the state we expect by observing the behavior of the driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point! Initially I was coming at this as something that would test but also prevent regression of restore in the future. Kind of a basic check to verify that values were set correctly. For example, if restore functionality was to ever regress to restore every submission queue with one entry less than intended. This theoretically would not be an invalid state for the driver (I think) but would definitely be incorrect since multiple restores would shrink the queue significantly. In my head, coming up with and testing for every such edge case by looking at behavior alone might prove to be hard.

What I am wondering is, does this type of testing that I built out provide any value at all? Do you think I should scrap this or should I be adding more tests to verify behavior on top of this? I am happy to go either way but interested to know your thoughts on how to proceed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a first step we just need to re-enable restore() in unit tests. The reason why it was disabled in first place is that it used attach_dma_buffer call in EmulatedDma, but the team's feedback was to remove it from EmulatedDma because it was mocking non-existing functionality.

You re-added attach_dma_buffer so I expect that team's feedback will be same, but let's see...

We need to redesign some interfaces to allow unit tests to mock pfn-based access. Or, maybe, hide pfn details in a separate module.

As a next step we would like to generate I/O traffic before/after servicing.

if let Some(task) = self.task.as_mut() {
task.stop().await;
let worker = task.task();

// Verify Admin Queue
match (&saved_state.worker_data.admin, &worker.admin) {
(None, None) => (),
(Some(admin_saved_state), Some(admin)) => {
// TODO: [expand-verify-restore-functionality] Currrently providing base_pfn value in u64, this might panic
let admin_saved_mem = mem.subblock(admin_saved_state.base_pfn.try_into().unwrap(), admin_saved_state.mem_len);
admin.verify_restore(&admin_saved_state, admin_saved_mem).await;
},
_ => panic!("admin queue states do not match"),
};

// Verify I/O queues in strict ordering
assert_eq!(saved_state.worker_data.io.len(), worker.io.len());
for index in 0..saved_state.worker_data.io.len() {
let io_saved_state = saved_state.worker_data.io[index].clone();
let io_saved_mem = mem.subblock(io_saved_state.queue_data.base_pfn.try_into().unwrap(), io_saved_state.queue_data.mem_len);
worker.io[index].verify_restore(&io_saved_state, io_saved_mem);
}
task.start();
} else {
panic!("task cannot be None() after restore");
}

assert_eq!(saved_state.device_id, self.device_id);

if let Some(identify) = &self.identify {
assert_eq!(saved_state.identify_ctrl.as_bytes(), identify.as_bytes());
} else {
panic!("idenitfy value cannot be None after restore");
}

// TODO: [expand-verify-restore-functionality] Namespace save is currently not supported.
assert!(self.nvme_keepalive);
}
}

async fn handle_asynchronous_events(
Expand Down
58 changes: 58 additions & 0 deletions vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,19 @@ impl PendingCommands {
next_cid_high_bits: Wrapping(*next_cid_high_bits),
})
}

/// Given the saved state, verifies the state of the PendingCommands to match the saved state
#[cfg(test)]
pub(crate) fn verify_restore(&self, saved_state: &PendingCommandsSavedState) {
// TODO: [expand-verify-restore-functionality] cid_key_bits are currently unused during restore.
assert_eq!(saved_state.commands.len(), self.commands.len());

for (index, command) in &self.commands {
command.verify_restore(&saved_state.commands[index]);
}

assert_eq!(saved_state.next_cid_high_bits, self.next_cid_high_bits.0);
}
}

impl QueuePair {
Expand Down Expand Up @@ -324,6 +337,33 @@ impl QueuePair {
Some(handler_data),
)
}

/// Given the saved state of a queue pair, this verifies the constructed queue pair.
/// Input memory block should already be constructed from the offsets.
#[cfg(test)]
pub(crate) async fn verify_restore(&self, saved_state: &QueuePairSavedState, saved_mem: MemoryBlock) {
// Entire memory region is checked below. No need for the the handler to check it again.
let _ = self.issuer.send.call(Req::Verify, saved_state.handler_data.clone()).await;

// `cancel` and `issuers` params are runtime parameters so we don't check underlying values.
let mut saved_mem_data: [u8; PAGE_SIZE] = [0; PAGE_SIZE];
let mut self_mem_data: [u8; PAGE_SIZE] = [0; PAGE_SIZE];

assert_eq!(saved_mem.len(), self.mem.len());

for pfn in 0..(saved_mem.len()/PAGE_SIZE) {
saved_mem.read_at(pfn * PAGE_SIZE, &mut saved_mem_data);
self.mem.read_at(pfn * PAGE_SIZE, &mut self_mem_data);

for i in 0..PAGE_SIZE {
assert_eq!(saved_mem_data[i], self_mem_data[i]);
}
}

assert_eq!(saved_state.qid, self.qid);
assert_eq!(saved_state.sq_entries, self.sq_entries);
assert_eq!(saved_state.cq_entries, self.cq_entries);
}
}

/// An error issuing an NVMe request.
Expand Down Expand Up @@ -583,6 +623,8 @@ enum Req {
Command(Rpc<spec::Command, spec::Completion>),
Inspect(inspect::Deferred),
Save(Rpc<(), Result<QueueHandlerSavedState, anyhow::Error>>),
#[cfg(test)]
Verify(Rpc<QueueHandlerSavedState, ()>),
}

#[derive(Inspect)]
Expand Down Expand Up @@ -668,6 +710,14 @@ impl QueueHandler {
// Do not allow any more processing after save completed.
break;
}
#[cfg(test)]
Req::Verify(verify_state) => {
let saved_state = verify_state.input();

self.sq.verify_restore(&saved_state.sq_state);
self.cq.verify_restore(&saved_state.cq_state);
self.commands.verify_restore(&saved_state.pending_cmds);
}
},
Event::Completion(completion) => {
assert_eq!(completion.sqid, self.sq.id());
Expand Down Expand Up @@ -718,6 +768,14 @@ impl QueueHandler {
}
}

#[cfg(test)]
impl PendingCommand {
/// Verifies the value of the pending command from a saved state
pub(crate) fn verify_restore(&self, saved_state: &PendingCommandSavedState) {
assert_eq!(saved_state.command, self.command);
}
}

pub(crate) fn admin_cmd(opcode: spec::AdminOpcode) -> spec::Command {
spec::Command {
cdw0: spec::Cdw0::new().with_opcode(opcode.0),
Expand Down
21 changes: 21 additions & 0 deletions vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,17 @@ impl SubmissionQueue {
mem,
})
}

/// Given the saved state, checks the state of the submission queue. Does not verify memory.
// TODO: Can this be an associated function instead?
#[cfg(test)]
pub(crate) fn verify_restore(&self, saved_state: &SubmissionQueueSavedState) {
assert_eq!(saved_state.sqid, self.sqid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions about these checks:

  1. Is there a reason to think that saved state value may be different from the restored state value? We literally create a new object from the saved values. If it checks if we did not forget to restore some fields then this can be guaranteed by "destructuring" fields which should be used in most cases, and compiler will give an error if something is missing.
  2. I would generally prefer to move these checks to the tests.rs. We can add public functions to read private values from the object.

assert_eq!(saved_state.head, self.head);
assert_eq!(saved_state.tail, self.tail);
assert_eq!(saved_state.committed_tail, self.committed_tail);
assert_eq!(saved_state.len, self.len);
}
}

#[derive(Inspect)]
Expand Down Expand Up @@ -201,6 +212,16 @@ impl CompletionQueue {
mem,
})
}

/// Given the saved state, checks the state of the completion queue. Does not verify memory.
#[cfg(test)]
pub(crate) fn verify_restore(&self, saved_state: &CompletionQueueSavedState) {
assert_eq!(saved_state.cqid, self.cqid);
assert_eq!(saved_state.head, self.head);
assert_eq!(saved_state.committed_head, self.committed_head);
assert_eq!(saved_state.len, self.len);
assert_eq!(saved_state.phase, self.phase);
}
}

fn advance(n: u32, l: u32) -> u32 {
Expand Down
27 changes: 21 additions & 6 deletions vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ use std::sync::Arc;
use test_with_tracing::test;
use user_driver::emulated::DeviceSharedMemory;
use user_driver::emulated::EmulatedDevice;
use user_driver::emulated::EmulatedDmaAllocator;
use user_driver::emulated::Mapping;
use user_driver::interrupt::DeviceInterrupt;
use user_driver::DeviceBacking;
Expand Down Expand Up @@ -233,6 +234,7 @@ async fn test_nvme_driver(driver: DefaultDriver, allow_dma: bool) {
}

async fn test_nvme_save_restore_inner(driver: DefaultDriver) {
// ===== SHARED RESOURCES =====
const MSIX_COUNT: u16 = 2;
const IO_QUEUE_COUNT: u16 = 64;
const CPU_COUNT: u32 = 64;
Expand Down Expand Up @@ -261,12 +263,16 @@ async fn test_nvme_save_restore_inner(driver: DefaultDriver) {
.await
.unwrap();

let device = EmulatedDevice::new(nvme_ctrl, msi_x, mem);
// ===== FIRST DRIVER INIT =====
let device = EmulatedDevice::new(nvme_ctrl, msi_x, mem.clone());
let mut nvme_driver = NvmeDriver::new(&driver_source, CPU_COUNT, device)
.await
.unwrap();
let _ns1 = nvme_driver.namespace(1).await.unwrap();
let saved_state = nvme_driver.save().await.unwrap();

// Tear down the original driver to kill the underlying tasks.
nvme_driver.shutdown().await;
// As of today we do not save namespace data to avoid possible conflict
// when namespace has changed during servicing.
// TODO: Review and re-enable in future.
Expand Down Expand Up @@ -299,11 +305,20 @@ async fn test_nvme_save_restore_inner(driver: DefaultDriver) {
// Wait for CSTS.RDY to set.
backoff.back_off().await;

let _new_device = EmulatedDevice::new(new_nvme_ctrl, new_msi_x, new_emu_mem);
// TODO: Memory restore is disabled for emulated DMA, uncomment once fixed.
// let _new_nvme_driver = NvmeDriver::restore(&driver_source, CPU_COUNT, new_device, &saved_state)
// .await
// .unwrap();
// ====== SECOND DRIVER INIT =====
let mem_new = DeviceSharedMemory::new(base_len, payload_len);
let new_device = EmulatedDevice::new(new_nvme_ctrl, new_msi_x, mem_new.clone());
let mut new_nvme_driver = NvmeDriver::restore(&driver_source, CPU_COUNT, new_device, &saved_state)
.await
.unwrap();


// ===== VERIFY RESTORE =====
let host_allocator = EmulatedDmaAllocator::new(mem_new.clone());
let verify_mem = DmaClient::attach_dma_buffer(&host_allocator, base_len, 0).unwrap();

// Verify restore functions will panic if verification failed.
new_nvme_driver.verify_restore(&saved_state, verify_mem).await;
}

#[derive(Inspect)]
Expand Down
4 changes: 2 additions & 2 deletions vm/devices/storage/nvme_spec/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ pub struct Aqa {
}

#[repr(C)]
#[derive(Copy, Clone, Debug, IntoBytes, Immutable, KnownLayout, FromBytes, Inspect)]
#[derive(Copy, Clone, Debug, IntoBytes, Immutable, KnownLayout, FromBytes, Inspect, PartialEq)]
pub struct Command {
pub cdw0: Cdw0,
pub nsid: u32,
Expand All @@ -146,7 +146,7 @@ pub struct Command {

#[derive(Inspect)]
#[bitfield(u32)]
#[derive(IntoBytes, Immutable, KnownLayout, FromBytes)]
#[derive(IntoBytes, Immutable, KnownLayout, FromBytes, PartialEq)]
pub struct Cdw0 {
pub opcode: u8,
#[bits(2)]
Expand Down
28 changes: 26 additions & 2 deletions vm/devices/user_driver/src/emulated.rs
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,21 @@ impl DeviceSharedMemory {
state: self.state.clone(),
})
}

// TODO: [nvme-keepalive-testing]
// This is only a stop-gap until we can swap out the back end of nvme tests to use real memory
pub fn alloc_specific(&self, len: usize, base_pfn: u64) -> Option<DmaBuffer>{
assert!(len % PAGE_SIZE == 0);
let count = len / PAGE_SIZE;
let start_page = base_pfn as usize;

let pages = (start_page..start_page + count).map(|p| p as u64).collect();
Some(DmaBuffer {
mem: self.mem.clone(),
pfns: pages,
state: self.state.clone(),
})
}
}

pub struct DmaBuffer {
Expand Down Expand Up @@ -270,15 +285,24 @@ pub struct EmulatedDmaAllocator {
shared_mem: DeviceSharedMemory,
}

impl EmulatedDmaAllocator {
pub fn new(shared_mem: DeviceSharedMemory) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a new function for this removed? It suddenly stopped working for me for some reason so I had to write this instead!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were few changes in this area recently

EmulatedDmaAllocator {
shared_mem,
}
}
}

impl DmaClient for EmulatedDmaAllocator {
fn allocate_dma_buffer(&self, len: usize) -> anyhow::Result<MemoryBlock> {
let memory = MemoryBlock::new(self.shared_mem.alloc(len).context("out of memory")?);
memory.as_slice().atomic_fill(0);
Ok(memory)
}

fn attach_dma_buffer(&self, _len: usize, _base_pfn: u64) -> anyhow::Result<MemoryBlock> {
anyhow::bail!("restore is not supported for emulated DMA")
fn attach_dma_buffer(&self, len: usize, base_pfn: u64) -> anyhow::Result<MemoryBlock> {
let memory = MemoryBlock::new( self.shared_mem.alloc_specific(len, base_pfn).context("could not alloc specific. out of memory")?);
Ok(memory)
}
}

Expand Down