-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add debug logging for all Oxide API requests #1053
base: main
Are you sure you want to change the base?
Conversation
0b5b0b3
to
3060f3c
Compare
Currently only the `auth` CLI subcommand will log any request details when debug logging is enabled. Adding debug logging for all requests will enable users to troubleshoot issues on any API endpoint. To do this, we add add pre and post-request hooks via Progenitor. However, this does not lend itself to tracking the duration of the request, or creating a single log event with the details from the the `Request` and `Response`. We add a `LogCtx` inner type to the client to act as a store while requests are in flight, but need something to associate the start and end. We could inject a header into the request, but on failure this will not be available to us. Instead, we use the tokio `task::Id` as an identifier, which is unique within a given runtime. This raises the possibility of a client shared between runtimes associating the wrong start with a finished request. We do not have a way to detect which runtime a task is associated with, `runtime::Id` is an unstable feature, but we will log a warning if we find a response without a corresponding request, along with the response details. When a request is run outside of a task, we fall back to logging separate events for start and end. We also create a `Span` that includes the log event. This is ignored by the CLI, but may be useful for other consumers of the SDK that want to integrate it with their monitoring infrastructure. An example event: { "timestamp": "2025-03-12T15:25:58.487790Z", "level": "DEBUG", "message": "request succeeded", "url": "https://oxide.sys.r3.oxide-preview.com/v1/disks?project=will", "path": "/v1/disks", "remote_addr": "45.154.216.35:443", "http.request.method": "GET", "http.response.content_length": 998, "http.response.status_code": 200, "start_time": "2025-03-12T15:25:58.048984Z", "duration_ms": 438, "oxide.request_id": "5d738e7a-ad78-4938-b682-c43f7b2a84be", "target": "oxide::tracing" } Closes #1014
3060f3c
to
c612eea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. I'll be interested to discuss the path forward for this idea with you. I'd be interested to understand the approaches you've considered and how we landed on this one.
Err(e) => e.into(), | ||
Ok(Err(e)) => e, | ||
Ok(Ok(())) => return, | ||
}; | ||
|
||
if let Some(io_err) = e.downcast_ref::<io::Error>() { | ||
if io_err.kind() == io::ErrorKind::BrokenPipe { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you prefer this to what you had written previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen the unwrap
we had on the JoinHandle panic every now and then when cancelling a command with Ctrl+C. This gives us a graceful exit.
.stderr(str::contains("DEBUG")); | ||
.success(); | ||
|
||
let stderr_str = std::str::from_utf8(&cmd.get_output().stderr).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would expectorate make this simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the same amount of code, since we'd need to scrub all the fields that don't have a stable response value. This method is agnostic on field ordering in the message, which I don't think is something we need to control for a machine-readable format like JSON.
.unwrap() | ||
.arg("--debug") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not using --debug becuase it gets other stuff? Do we want to make --debug equivalent to RUST_LOG=oxide=debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, filtering to just oxide should be the default for --debug. In the rare event we need the hyper logs we can set RUST_LOG=debug
to override.
@@ -54136,7 +54137,7 @@ impl Client { | |||
/// `baseurl` is the base URL provided to the internal | |||
/// `reqwest::Client`, and should include a scheme and hostname, | |||
/// as well as port and a path stem if applicable. | |||
pub fn new(baseurl: &str) -> Self { | |||
pub fn new(baseurl: &str, inner: crate::LogCtx) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a pretty significant burr that we're imposing on all consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this is resolvable with the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, adding an inner type is necessary to retain context with this approach.
// Try to store the request details for retrieval when the requests finishes. | ||
// If we're not in a tokio Task this is not possible, instead log separate | ||
// events for start and finish. | ||
let Some(task_id) = tokio::task::try_id() else { | ||
tracing::debug!( | ||
url = %url, | ||
path = url.path(), | ||
http.request.method = %request.method(), | ||
http.request.body = details.body, | ||
); | ||
return; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems limiting for SDK consumers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ideal, agreed. I think it will be confusing why some requests have more useful logs than others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested to understand the approaches you've considered and how we landed on this one.
Initially I attempted to use reqwest::middleware
for logging oxidecomputer/progenitor#1079 as part of Progenitor, but this had the downside of making that a dependency of all consumers of that library and we decided against that approach.
Moving over to using the pre/post hooks, I have a hard requirement that we be able to track request latency and accurately track which request has failed. This means we need some way of relating the start and end of a request from separate hooks.
Initially I considered adding a header to the request, but this was untenable because we will not be able to access the headers if a request fails. This has the obvious downside of not being able to log a complete event on failure, but also we cannot reliably remove the request start details from LogCtx
. On a long-lived process this could lead to significant memory leaks.
I looked into adding a task local identifier, but that wasn't feasible as we would need a larger scope to feed that into the hooks.
This led me to task ID, which this PR landed on, but this has the downside of not being in all circumstances. If the task ID is not present, we avoid leaking events when it isn't available by logging separate start and end events, but we lose duration and cannot associate a request body with the failure, just the URL. For requests like OxQL queries, we need to know the body to understand the failure.
One possibility that occurred to me after sending this PR is to add yet another hook to Progenitor, the exec_hook
, which will allow users wrap the actual execution of the request. The hook would take the request, client, and an FnMut
that executes the actual request. This can replicate all the functionality of the pre/post hooks, while providing more control to users.
Beyond logging, you could implement retries, perform a dry-run, or any arbitrary action.
async fn exec_hook<F>(
request: reqwest::Request,
client: &reqwest::Client,
mut f: F, // Executes the request
) -> Result<reqwest::Response, reqwest::Error>
where
F: FnMut(
reqwest::Request,
&reqwest::Client,
) -> Pin<
Box<dyn Future<Output = Result<reqwest::Response, reqwest::Error>> + Send + 'static>,
>,
{
// Do stuff before exec
f(request, client).await // Maybe add tracing::instrument to the future.
// Do stuff after exec, retry, etc.
}
Err(e) => e.into(), | ||
Ok(Err(e)) => e, | ||
Ok(Ok(())) => return, | ||
}; | ||
|
||
if let Some(io_err) = e.downcast_ref::<io::Error>() { | ||
if io_err.kind() == io::ErrorKind::BrokenPipe { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen the unwrap
we had on the JoinHandle panic every now and then when cancelling a command with Ctrl+C. This gives us a graceful exit.
.unwrap() | ||
.arg("--debug") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, filtering to just oxide should be the default for --debug. In the rare event we need the hyper logs we can set RUST_LOG=debug
to override.
// Try to store the request details for retrieval when the requests finishes. | ||
// If we're not in a tokio Task this is not possible, instead log separate | ||
// events for start and finish. | ||
let Some(task_id) = tokio::task::try_id() else { | ||
tracing::debug!( | ||
url = %url, | ||
path = url.path(), | ||
http.request.method = %request.method(), | ||
http.request.body = details.body, | ||
); | ||
return; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ideal, agreed. I think it will be confusing why some requests have more useful logs than others.
.stderr(str::contains("DEBUG")); | ||
.success(); | ||
|
||
let stderr_str = std::str::from_utf8(&cmd.get_output().stderr).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the same amount of code, since we'd need to scrub all the fields that don't have a stable response value. This method is agnostic on field ordering in the message, which I don't think is something we need to control for a machine-readable format like JSON.
The common case for users using `--debug` will be to observe the logs generated by the SDK, not the events from hyper and reqwest. If all logs are needed, we can advise them to use `RUST_LOG=debug` to enable debug logs for all components. Set `--debug` to set the log env filter to `oxide=debug`.
Co-authored-by: Adam Leventhal <ahl@oxide.computer>
I don't disagree with the desire to have this information, but I would observe that these kind of assertions aren't helpful for understanding the rationale or broader utility.
I'm not sure I follow.
This seems pretty tricky for customer consumers of the SDK to understand. Re: |
I thought I might be able to use a task_local! to create an object that would carry over from the pre_hook to the post_hook. However, the value populated into tokio::task_local! {
static LOG: LogDetails;
}
fn pre_hook() {
let details = LogDetails::new()
LOG.scope(details, async move {
// inner value of LOG only accessible here.
});
}
fn post_hook() {
let details = LOG.get(); // panic, inner value of LOG was dropped in pre_hook.
} I don't think the |
Currently only the
auth
CLI subcommand will log any request details when debug logging is enabled. Adding debug logging for all requests will enable users to troubleshoot issues on any API endpoint.To do this, we add add pre and post-request hooks via Progenitor. However, this does not lend itself to tracking the duration of the request, or creating a single log event with the details from the the
Request
andResponse
.We add a
LogCtx
inner type to the client to act as a store while requests are in flight, but need something to associate the start and end. We could inject a header into the request, but on failure this will not be available to us.Instead, we use the tokio
task::Id
as an identifier, which is unique within a given runtime. This raises the possibility of a client shared between runtimes associating the wrong start with a finished request. We do not have a way to detect which runtime a task is associated with,runtime::Id
is an unstable feature, but we will log a warning if we find a response without a corresponding request, along with the response details. Should a request be run outside of a task, we fall back to logging separate events for start and end.We also create a
Span
that includes the log event. This is ignored by the CLI, but may be useful for other consumers of the SDK that want to integrate it with their monitoring infrastructure.An example event:
Closes #1014