Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] nvme_driver: fix overflow computing queue size and don't wait for bad devices forever #682

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

gurasinghMS
Copy link
Contributor

@gurasinghMS gurasinghMS commented Jan 16, 2025

When running the fuzzer locally I came across two bugs causing a panic and a timeout respectively.

  • Computing the queue size for the I/O queues can cause an addition overflow related panic if the size of the queue is MAX_U16. To fix this, the code now uses saturating_add.
  • A timeout related bug popped up when the driver tries resetting the underlying device due to failures. If the CFG bit is set, the driver never times out of the retry attempts, instead it loops forever waiting for the controller to report a NOT_READY state.

@gurasinghMS gurasinghMS requested review from a team as code owners January 16, 2025 22:05
@chris-oo
Copy link
Member

please add a description for the change.

@mattkur
Copy link
Contributor

mattkur commented Jan 16, 2025

please add a description for the change.

I would also ask you to add a test here, but I know that you found these with changes in the nvme driver fuzzer you're working on. So, I think getting that fuzzer changes is a decent test and the best test ROI.

@@ -116,6 +126,9 @@ impl<T: DeviceRegisterIo + Inspect> Bar0<T> {
if u32::from(csts) == !0 {
break false;
}
if start.elapsed() >= timeout {
break false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller seems to rely on this actually resetting the device before it frees buffers. With this change, you might now return with the device still referencing these buffers. And that can lead to memory corruption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            // Hold onto responses until the reset completes so that waiting IOs do
            // not think the memory is unaliased by the device.

@jstarks
Copy link
Member

jstarks commented Jan 16, 2025

The PR title is misleading--it implies that this is primarily a change to the fuzzer. But the meaningful changes are to the driver. It should be prefixed with "nvme_driver", not "fuzz_nvme_driver".

@mattkur mattkur changed the title fuzz_nvme_driver: Minor bugs found from local testing on the fuzzer nvme_driver: fix overflow computing queue size and don't wait for bad devices forever Jan 16, 2025
@mattkur
Copy link
Contributor

mattkur commented Jan 16, 2025

The PR title is misleading--it implies that this is primarily a change to the fuzzer. But the meaningful changes are to the driver. It should be prefixed with "nvme_driver", not "fuzz_nvme_driver".

I was already changing it, but @gurasinghMS : I want to echo this feedback for future reference (see my suggested new title as an example)

@jstarks
Copy link
Member

jstarks commented Jan 16, 2025

Are we worried about the timeout causing additional issues in a virtualized environment under load? What's the to value in practice for the devices we care about?

@mattkur
Copy link
Contributor

mattkur commented Jan 16, 2025

Are we worried about the timeout causing additional issues in a virtualized environment under load? What's the to value in practice for the devices we care about?

Yeah, I am concerned about this. But mostly in the sense that we'd read some huge value and hang for a very long time. That reduces down to the current behavior if you shrink your observation time window (we currently hang forever...).

Are you concerned about a CAP.TO that's too low such that we time out before the device has actually had a chance to get ready?

@jstarks
Copy link
Member

jstarks commented Jan 16, 2025

Are you concerned about a CAP.TO that's too low such that we time out before the device has actually had a chance to get ready?

Yes.

@gurasinghMS gurasinghMS marked this pull request as draft January 22, 2025 19:00
@gurasinghMS gurasinghMS changed the title nvme_driver: fix overflow computing queue size and don't wait for bad devices forever [DRAFT] nvme_driver: fix overflow computing queue size and don't wait for bad devices forever Jan 22, 2025
@gurasinghMS
Copy link
Contributor Author

Pausing work on the PCIe interface fuzzing and marking this PR as a Draft for the time being. Will pick this work back up after working on the NVMe interface fuzzing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants