Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: extend API request queue's memory of prior requests #880

Merged
merged 8 commits into from
Mar 11, 2025

Conversation

gjcolombo
Copy link
Contributor

Today the server's API request queue decides whether to accept or reject incoming requests by storing a fixed disposition for each kind of request and adjusting those dispositions as new requests are queued or as a VM's state changes. This is not always enough information to maintain the proper dispositions. Consider the following example:

  1. A VM begins to start. Reboot requests are denied in this state (since a VM that hasn't booted yet can't be rebooted).
  2. A caller queues a request to stop the VM after it starts. All subsequent reboot requests should now be denied because the VM will halt before they can be serviced.
  3. The VM successfully starts. The correct reboot disposition is still "deny," because there's a pending stop request, but there's no way to discern this from the prior reboot disposition (it was also "deny" before the stop request was queued!).

To address this sort of problem, and to pave the way for better handling of Crucible VCR replacements during instance start, refactor the queue as follows:

  • Remember what kinds of outstanding requests have not been processed and dispose of requests based on this state and the queue's notion of the instance's overall state.
  • Break external API requests into "state change" and "component change" requests. This will allow them to be dequeued independently when the state driver wants to handle component changes without changing the order of state change requests.
  • Tweak the language the state driver uses to communicate with the queue: instead of saying "the VM is now in state X," the driver says "I handled (or failed to handle) a request of type Y."

Tests: cargo test, PHD.

Fixes #879.

Today the server's API request queue decides whether to accept or reject
incoming requests by storing a fixed disposition for each kind of
request and adjusting those dispositions as new requests are queued or
as a VM's state changes. This is not always enough information to
maintain the proper dispositions. Consider the following example:

1. A VM begins to start. Reboot requests are denied in this state (since
   a VM that hasn't booted yet can't be rebooted).
2. A caller queues a request to stop the VM after it starts. All
   subsequent reboot requests should now be denied because the VM will
   halt before they can be serviced.
3. The VM successfully starts. The correct reboot disposition is still
   "deny," because there's a pending stop request, but there's no way to
   discern this from the prior reboot disposition (it was also "deny"
   before the stop request was queued!).

To address this sort of problem, and to pave the way for better handling
of Crucible VCR replacements during instance start, refactor the queue
as follows:

- Remember what kinds of outstanding requests have not been processed
  and dispose of requests based on this state and the queue's notion
  of the instance's overall state.
- Break external API requests into "state change" and "component change"
  requests. This will allow them to be dequeued independently when the
  state driver wants to handle component changes without changing the
  order of state change requests.
- Tweak the language the state driver uses to communicate with the
  queue: instead of saying "the VM is now in state X," the driver says
  "I handled (or failed to handle) a request of type Y."

Tests: cargo test, PHD.
Comment on lines +318 to +321
/// Pops the next request off of the queue. If the queue contains both state
/// change and component change requests, the next state change request is
/// popped first (even if it arrived later in time than the next component
/// change request).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also drafted a version of this code that pops the oldest request regardless of its kind, but maintains a separate index of component change requests so that the state driver can ask to dequeue the oldest such request while leaving the state change requests alone. (The point of the exercise is to give the driver a way to pop VCR replacement requests while starting the VM without having to buffer any state change requests; see #873 for more discussion.)

I ended up not going with this because (a) it's simpler not to have the separate change request index, and (b) it also seemed to me to make some sense to privilege external API requests (like requests to stop) over internal housekeeping operations like VCR replacement. But I'd like to hear if others feel strongly here (it's not that hard to bring that code back).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with your rationale here. I don't know if the property of having one total ordering across state-change and component-change requests is actually important, given that the whole point of this change is to allow component change requests to be serviced whilst a state change is in progress, and the disposition to subsequent component change requests based on in-flight state changes is now tracked outside of the actual queue, rather than based on the order of requests in the queue. So I think the simpler solution of just having two queues is fine.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks great to me! I left some fairly minor notes, but I'm down with the overall design here.

This was the request queue's previous behavior. It ensures that a
request to stop drives the current Propolis to a halted state, but does
not guarantee that the "logical" VM isn't running somewhere else (i.e.
in a migration target).
@gjcolombo gjcolombo merged commit 7cd7c88 into master Mar 11, 2025
11 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/refactor-server-request-queue branch March 11, 2025 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: state driver request queue dispositions are difficult to manage
2 participants