[DRAFT] nvme_driver: fix overflow computing queue size and don't wait for bad devices forever #682

gurasinghMS · 2025-01-16T22:05:56Z

When running the fuzzer locally I came across two bugs causing a panic and a timeout respectively.

Computing the queue size for the I/O queues can cause an addition overflow related panic if the size of the queue is MAX_U16. To fix this, the code now uses saturating_add.
A timeout related bug popped up when the driver tries resetting the underlying device due to failures. If the CFG bit is set, the driver never times out of the retry attempts, instead it loops forever waiting for the controller to report a NOT_READY state.

chris-oo · 2025-01-16T22:12:47Z

please add a description for the change.

vm/devices/storage/disk_nvme/nvme_driver/src/registers.rs

vm/devices/storage/disk_nvme/nvme_driver/Cargo.toml

mattkur · 2025-01-16T22:28:43Z

please add a description for the change.

I would also ask you to add a test here, but I know that you found these with changes in the nvme driver fuzzer you're working on. So, I think getting that fuzzer changes is a decent test and the best test ROI.

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

jstarks · 2025-01-16T22:45:31Z

vm/devices/storage/disk_nvme/nvme_driver/src/registers.rs

@@ -116,6 +126,9 @@ impl<T: DeviceRegisterIo + Inspect> Bar0<T> {
            if u32::from(csts) == !0 {
                break false;
            }
+            if start.elapsed() >= timeout {
+                break false;


The caller seems to rely on this actually resetting the device before it frees buffers. With this change, you might now return with the device still referencing these buffers. And that can lead to memory corruption.

// Hold onto responses until the reset completes so that waiting IOs do // not think the memory is unaliased by the device.

jstarks · 2025-01-16T22:49:30Z

The PR title is misleading--it implies that this is primarily a change to the fuzzer. But the meaningful changes are to the driver. It should be prefixed with "nvme_driver", not "fuzz_nvme_driver".

mattkur · 2025-01-16T22:53:05Z

The PR title is misleading--it implies that this is primarily a change to the fuzzer. But the meaningful changes are to the driver. It should be prefixed with "nvme_driver", not "fuzz_nvme_driver".

I was already changing it, but @gurasinghMS : I want to echo this feedback for future reference (see my suggested new title as an example)

jstarks · 2025-01-16T22:56:21Z

Are we worried about the timeout causing additional issues in a virtualized environment under load? What's the to value in practice for the devices we care about?

mattkur · 2025-01-16T22:59:14Z

Are we worried about the timeout causing additional issues in a virtualized environment under load? What's the to value in practice for the devices we care about?

Yeah, I am concerned about this. But mostly in the sense that we'd read some huge value and hang for a very long time. That reduces down to the current behavior if you shrink your observation time window (we currently hang forever...).

Are you concerned about a CAP.TO that's too low such that we time out before the device has actually had a chance to get ready?

jstarks · 2025-01-16T23:11:51Z

Are you concerned about a CAP.TO that's too low such that we time out before the device has actually had a chance to get ready?

Yes.

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

vm/devices/storage/disk_nvme/nvme_driver/fuzz/fuzz_nvme_driver.rs

…tion field in the NvmeDriver struct

gurasinghMS · 2025-01-22T19:06:40Z

Pausing work on the PCIe interface fuzzing and marking this PR as a Draft for the time being. Will pick this work back up after working on the NVMe interface fuzzing.

gurasinghMS added 5 commits January 15, 2025 13:52

Bug fix 1

7b8ab5c

Adding a timeout to the driver reset command

cf9065f

Adding a timeout specifically for the fuzzer

98ec89a

Updating registers file

9927625

Updated registers after testing with the fuzzer

6a7b023

gurasinghMS requested review from a team as code owners January 16, 2025 22:05

Fixed saturating add

5192445

mattkur reviewed Jan 16, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/registers.rs Outdated Show resolved Hide resolved

mattkur reviewed Jan 16, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/Cargo.toml Outdated Show resolved Hide resolved

resolving comment

1a81c84

mattkur reviewed Jan 16, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Show resolved Hide resolved

jstarks reviewed Jan 16, 2025

View reviewed changes

mattkur changed the title ~~fuzz_nvme_driver: Minor bugs found from local testing on the fuzzer~~ nvme_driver: fix overflow computing queue size and don't wait for bad devices forever Jan 16, 2025

Pivot to using runtime parameters over using cfg

f29427f

gurasinghMS commented Jan 17, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Show resolved Hide resolved

smmalis37 reviewed Jan 17, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Outdated Show resolved Hide resolved

smalis-msft reviewed Jan 17, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/fuzz/fuzz_nvme_driver.rs Outdated Show resolved Hide resolved

Fuzzer timeout constant now lives in the fuzzer

1b65cfb

smalis-msft reviewed Jan 21, 2025

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/fuzz/fuzz_nvme_driver.rs Outdated Show resolved Hide resolved

gurasinghMS added 3 commits January 21, 2025 10:28

constant timeout is now a Duration, using inspect(debug) for the dura…

d64d2b2

…tion field in the NvmeDriver struct

calculating io queue sizes avoiding potential addition overflow issues

57c6831

Bail if MQES is less than 1

da8e245

gurasinghMS marked this pull request as draft January 22, 2025 19:00

gurasinghMS changed the title ~~nvme_driver: fix overflow computing queue size and don't wait for bad devices forever~~ [DRAFT] nvme_driver: fix overflow computing queue size and don't wait for bad devices forever Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] nvme_driver: fix overflow computing queue size and don't wait for bad devices forever #682

[DRAFT] nvme_driver: fix overflow computing queue size and don't wait for bad devices forever #682

gurasinghMS commented Jan 16, 2025 •

edited

Loading

chris-oo commented Jan 16, 2025

mattkur commented Jan 16, 2025

jstarks Jan 16, 2025

jstarks Jan 16, 2025

jstarks commented Jan 16, 2025

mattkur commented Jan 16, 2025

jstarks commented Jan 16, 2025

mattkur commented Jan 16, 2025

jstarks commented Jan 16, 2025

gurasinghMS commented Jan 22, 2025

[DRAFT] nvme_driver: fix overflow computing queue size and don't wait for bad devices forever #682

Are you sure you want to change the base?

[DRAFT] nvme_driver: fix overflow computing queue size and don't wait for bad devices forever #682

Conversation

gurasinghMS commented Jan 16, 2025 • edited Loading

chris-oo commented Jan 16, 2025

mattkur commented Jan 16, 2025

jstarks Jan 16, 2025

Choose a reason for hiding this comment

jstarks Jan 16, 2025

Choose a reason for hiding this comment

jstarks commented Jan 16, 2025

mattkur commented Jan 16, 2025

jstarks commented Jan 16, 2025

mattkur commented Jan 16, 2025

jstarks commented Jan 16, 2025

gurasinghMS commented Jan 22, 2025

gurasinghMS commented Jan 16, 2025 •

edited

Loading