Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debug logging for all Oxide API requests #1053

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

wfchandler
Copy link
Contributor

Currently only the auth CLI subcommand will log any request details when debug logging is enabled. Adding debug logging for all requests will enable users to troubleshoot issues on any API endpoint.

To do this, we add add pre and post-request hooks via Progenitor. However, this does not lend itself to tracking the duration of the request, or creating a single log event with the details from the the Request and Response.

We add a LogCtx inner type to the client to act as a store while requests are in flight, but need something to associate the start and end. We could inject a header into the request, but on failure this will not be available to us.

Instead, we use the tokio task::Id as an identifier, which is unique within a given runtime. This raises the possibility of a client shared between runtimes associating the wrong start with a finished request. We do not have a way to detect which runtime a task is associated with, runtime::Id is an unstable feature, but we will log a warning if we find a response without a corresponding request, along with the response details. Should a request be run outside of a task, we fall back to logging separate events for start and end.

We also create a Span that includes the log event. This is ignored by the CLI, but may be useful for other consumers of the SDK that want to integrate it with their monitoring infrastructure.

An example event:

  {
    "timestamp": "2025-03-12T15:25:58.487790Z",
    "level": "DEBUG",
    "message": "request succeeded",
    "url": "https://oxide.sys.r3.oxide-preview.com/v1/disks?project=will",
    "path": "/v1/disks",
    "remote_addr": "45.154.216.35:443",
    "http.request.method": "GET",
    "http.response.content_length": 998,
    "http.response.status_code": 200,
    "start_time": "2025-03-12T15:25:58.048984Z",
    "duration_ms": 438,
    "oxide.request_id": "5d738e7a-ad78-4938-b682-c43f7b2a84be",
    "target": "oxide::tracing"
  }

Closes #1014

@wfchandler wfchandler force-pushed the wc/event-logging branch 2 times, most recently from 0b5b0b3 to 3060f3c Compare March 12, 2025 20:13
Currently only the `auth` CLI subcommand will log any request details
when debug logging is enabled. Adding debug logging for all requests
will enable users to troubleshoot issues on any API endpoint.

To do this, we add add pre and post-request hooks via Progenitor.
However, this does not lend itself to tracking the duration of the
request, or creating a single log event with the details from the the
`Request` and `Response`.

We add a `LogCtx` inner type to the client to act as a store while
requests are in flight, but need something to associate the start and
end. We could inject a header into the request, but on failure this will
not be available to us.

Instead, we use the tokio `task::Id` as an identifier, which is unique
within a given runtime. This raises the possibility of a client shared
between runtimes associating the wrong start with a finished request.
We do not have a way to detect which runtime a task is associated with,
`runtime::Id` is an unstable feature, but we will log a warning if we
find a response without a corresponding request, along with the response
details. When a request is run outside of a task, we fall back to
logging separate events for start and end.

We also create a `Span` that includes the log event. This is ignored by
the CLI, but may be useful for other consumers of the SDK that want to
integrate it with their monitoring infrastructure.

An example event:

  {
    "timestamp": "2025-03-12T15:25:58.487790Z",
    "level": "DEBUG",
    "message": "request succeeded",
    "url": "https://oxide.sys.r3.oxide-preview.com/v1/disks?project=will",
    "path": "/v1/disks",
    "remote_addr": "45.154.216.35:443",
    "http.request.method": "GET",
    "http.response.content_length": 998,
    "http.response.status_code": 200,
    "start_time": "2025-03-12T15:25:58.048984Z",
    "duration_ms": 438,
    "oxide.request_id": "5d738e7a-ad78-4938-b682-c43f7b2a84be",
    "target": "oxide::tracing"
  }

Closes #1014
@wfchandler wfchandler marked this pull request as ready for review March 13, 2025 14:26
@wfchandler wfchandler requested a review from ahl March 13, 2025 14:26
Copy link
Collaborator

@ahl ahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. I'll be interested to discuss the path forward for this idea with you. I'd be interested to understand the approaches you've considered and how we landed on this one.

Comment on lines +105 to 113
Err(e) => e.into(),
Ok(Err(e)) => e,
Ok(Ok(())) => return,
};

if let Some(io_err) = e.downcast_ref::<io::Error>() {
if io_err.kind() == io::ErrorKind::BrokenPipe {
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you prefer this to what you had written previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen the unwrap we had on the JoinHandle panic every now and then when cancelling a command with Ctrl+C. This gives us a graceful exit.

.stderr(str::contains("DEBUG"));
.success();

let stderr_str = std::str::from_utf8(&cmd.get_output().stderr).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would expectorate make this simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the same amount of code, since we'd need to scrub all the fields that don't have a stable response value. This method is agnostic on field ordering in the message, which I don't think is something we need to control for a machine-readable format like JSON.

.unwrap()
.arg("--debug")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not using --debug becuase it gets other stuff? Do we want to make --debug equivalent to RUST_LOG=oxide=debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, filtering to just oxide should be the default for --debug. In the rare event we need the hyper logs we can set RUST_LOG=debug to override.

@@ -54136,7 +54137,7 @@ impl Client {
/// `baseurl` is the base URL provided to the internal
/// `reqwest::Client`, and should include a scheme and hostname,
/// as well as port and a path stem if applicable.
pub fn new(baseurl: &str) -> Self {
pub fn new(baseurl: &str, inner: crate::LogCtx) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a pretty significant burr that we're imposing on all consumers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this is resolvable with the current approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, adding an inner type is necessary to retain context with this approach.

Comment on lines +73 to +84
// Try to store the request details for retrieval when the requests finishes.
// If we're not in a tokio Task this is not possible, instead log separate
// events for start and finish.
let Some(task_id) = tokio::task::try_id() else {
tracing::debug!(
url = %url,
path = url.path(),
http.request.method = %request.method(),
http.request.body = details.body,
);
return;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems limiting for SDK consumers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal, agreed. I think it will be confusing why some requests have more useful logs than others.

Copy link
Contributor Author

@wfchandler wfchandler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to understand the approaches you've considered and how we landed on this one.

Initially I attempted to use reqwest::middleware for logging oxidecomputer/progenitor#1079 as part of Progenitor, but this had the downside of making that a dependency of all consumers of that library and we decided against that approach.

Moving over to using the pre/post hooks, I have a hard requirement that we be able to track request latency and accurately track which request has failed. This means we need some way of relating the start and end of a request from separate hooks.

Initially I considered adding a header to the request, but this was untenable because we will not be able to access the headers if a request fails. This has the obvious downside of not being able to log a complete event on failure, but also we cannot reliably remove the request start details from LogCtx. On a long-lived process this could lead to significant memory leaks.

I looked into adding a task local identifier, but that wasn't feasible as we would need a larger scope to feed that into the hooks.

This led me to task ID, which this PR landed on, but this has the downside of not being in all circumstances. If the task ID is not present, we avoid leaking events when it isn't available by logging separate start and end events, but we lose duration and cannot associate a request body with the failure, just the URL. For requests like OxQL queries, we need to know the body to understand the failure.

One possibility that occurred to me after sending this PR is to add yet another hook to Progenitor, the exec_hook, which will allow users wrap the actual execution of the request. The hook would take the request, client, and an FnMut that executes the actual request. This can replicate all the functionality of the pre/post hooks, while providing more control to users.

Beyond logging, you could implement retries, perform a dry-run, or any arbitrary action.

async fn exec_hook<F>(
    request: reqwest::Request,
    client: &reqwest::Client,
    mut f: F, // Executes the request
) -> Result<reqwest::Response, reqwest::Error>
where
    F: FnMut(
        reqwest::Request,
        &reqwest::Client,
    ) -> Pin<
        Box<dyn Future<Output = Result<reqwest::Response, reqwest::Error>> + Send + 'static>,
    >,
{
    // Do stuff before exec
    f(request, client).await // Maybe add tracing::instrument to the future.
    // Do stuff after exec, retry, etc.
}

Comment on lines +105 to 113
Err(e) => e.into(),
Ok(Err(e)) => e,
Ok(Ok(())) => return,
};

if let Some(io_err) = e.downcast_ref::<io::Error>() {
if io_err.kind() == io::ErrorKind::BrokenPipe {
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen the unwrap we had on the JoinHandle panic every now and then when cancelling a command with Ctrl+C. This gives us a graceful exit.

.unwrap()
.arg("--debug")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, filtering to just oxide should be the default for --debug. In the rare event we need the hyper logs we can set RUST_LOG=debug to override.

Comment on lines +73 to +84
// Try to store the request details for retrieval when the requests finishes.
// If we're not in a tokio Task this is not possible, instead log separate
// events for start and finish.
let Some(task_id) = tokio::task::try_id() else {
tracing::debug!(
url = %url,
path = url.path(),
http.request.method = %request.method(),
http.request.body = details.body,
);
return;
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal, agreed. I think it will be confusing why some requests have more useful logs than others.

.stderr(str::contains("DEBUG"));
.success();

let stderr_str = std::str::from_utf8(&cmd.get_output().stderr).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the same amount of code, since we'd need to scrub all the fields that don't have a stable response value. This method is agnostic on field ordering in the message, which I don't think is something we need to control for a machine-readable format like JSON.

wfchandler and others added 2 commits March 18, 2025 11:04
The common case for users using `--debug` will be to observe the logs
generated by the SDK, not the events from hyper and reqwest. If all logs
are needed, we can advise them to use `RUST_LOG=debug` to enable debug
logs for all components.

Set `--debug` to set the log env filter to `oxide=debug`.
Co-authored-by: Adam Leventhal <ahl@oxide.computer>
@wfchandler wfchandler requested a review from ahl March 18, 2025 15:25
@ahl
Copy link
Collaborator

ahl commented Mar 18, 2025

I have a hard requirement that we be able to track request latency and accurately track which request has failed

I don't disagree with the desire to have this information, but I would observe that these kind of assertions aren't helpful for understanding the rationale or broader utility.

I looked into adding a task local identifier, but that wasn't feasible as we would need a larger scope to feed that into the hooks.

I'm not sure I follow.

This led me to task ID, which this PR landed on, but this has the downside of not being in all circumstances.

This seems pretty tricky for customer consumers of the SDK to understand.

Re: exec_hook -- neat idea! Let me think about it?

@wfchandler
Copy link
Contributor Author

I looked into adding a task local identifier, but that wasn't feasible as we would need a larger scope to feed that into the hooks.

I'm not sure I follow.

I thought I might be able to use a task_local! to create an object that would carry over from the pre_hook to the post_hook. However, the value populated into LocalKey needs to be accessed within that scope like the example below, so there's no way to pass it between the hooks without modifying Progenitor.

tokio::task_local! {
    static LOG: LogDetails;
}

fn pre_hook() {
    let details = LogDetails::new()
    LOG.scope(details, async move {
        // inner value of LOG only accessible here.
    });

}

fn post_hook() {
    let details = LOG.get(); // panic, inner value of LOG was dropped in pre_hook.
}

I don't think the task_local approach is worth pursuing further, to be clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

debug could make the CLI more chatty about API requests
2 participants