Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test oximeter metrics end to end #855

Merged
merged 15 commits into from
Mar 14, 2025
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions phd-tests/framework/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,16 @@ bhyve_api.workspace = true
camino = { workspace = true, features = ["serde1"] }
cfg-if.workspace = true
cpuid_utils.workspace = true
dropshot.workspace = true
errno.workspace = true
fatfs.workspace = true
futures.workspace = true
flate2.workspace = true
hex.workspace = true
libc.workspace = true
newtype_derive.workspace = true
omicron-common.workspace = true
oximeter.workspace = true
propolis-client.workspace = true
reqwest = { workspace = true, features = ["blocking"] }
ring.workspace = true
Expand Down
36 changes: 35 additions & 1 deletion phd-tests/framework/src/test_vm/environment.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,31 @@ pub enum VmLocation {
// TODO: Support remote VMs.
}

/// Specifies where test VMs should report metrics to, if anywhere.
#[derive(Clone, Copy, Debug)]
pub enum MetricsLocation {
/// Oximeter metrics should be reported to a server colocated with the test
/// VM to be started.
Local,
// When the time comes to support remote VMs, it will presumably be useful
// to have local and (perhaps multiple) remote VMs report metrics to the
// same server. But we don't support remote VMs yet.
}
Comment on lines +25 to +28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-host PHD is obviously not very well fleshed-out yet, but I think the way I would develop it might make this enum obsolete. My 30,000-foot sketch of this is

  • of the several machines conducting a multi-host PHD run, one is primary and the others are secondary
  • the primary machine runs phd-runner and executes the actual test cases
  • the secondary machines run a phd-agent that exposes a Dropshot API that handles the tasks that need to be performed on remote machines (e.g. starting/stopping Propolis servers)
  • when invoking the runner on the primary, one passes the addresses of the agent servers

IIUC, even with an architecture like this, the fake Oximeter server still needs to be "local," even if the metrics are coming from a secondary host: submitted samples need to show up in the runner's Framework so that they can be queried by the tests running in that process.

This is obviated somewhat by switching to Clickhouse, I think, since then (I presume) there would be a single Clickhouse server instance that all PHD processes (both runner and agent) could insert data into, and the framework on the runner would just query that server. Even so, it still seems possible in that case for remote Propolis processes just to submit their samples to the collector in the primary runner.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was figuring in a case like that we'd still use Local to send localhost:{oximeter_port} instead of something fancier, but overall that would boil it down to "wants oximeter: yes/no". i do think the plumbing to express that here today is a bit more awkward.. we don't really have a clean "set up dependent services" -> "start propolis-server" split in PHD and that does get to be more reorganizing than i want to do here 😅


#[derive(Clone, Debug)]
pub struct EnvironmentSpec {
pub(crate) location: VmLocation,
pub(crate) propolis_artifact: String,
pub(crate) metrics: Option<MetricsLocation>,
}

impl EnvironmentSpec {
pub(crate) fn new(location: VmLocation, propolis_artifact: &str) -> Self {
Self { location, propolis_artifact: propolis_artifact.to_owned() }
Self {
location,
propolis_artifact: propolis_artifact.to_owned(),
metrics: None,
}
}

pub fn location(&mut self, location: VmLocation) -> &mut Self {
Expand All @@ -37,6 +53,11 @@ impl EnvironmentSpec {
self
}

pub fn metrics(&mut self, metrics: Option<MetricsLocation>) -> &mut Self {
self.metrics = metrics;
self
}

pub(crate) async fn build<'a>(
&self,
framework: &'a Framework,
Expand Down Expand Up @@ -75,13 +96,26 @@ impl<'a> Environment<'a> {
.port_allocator
.next()
.context("getting VNC server port")?;
let metrics_addr = builder.metrics.and_then(|m| match m {
MetricsLocation::Local => {
// If the test requests metrics are local, we'll start
// an Oximeter stand-in for this VM when setting up this
// environment later. `start_local_vm` will patch in the
// actual server address when it has created one.
//
// If the VM is to be started remotely but requests
// "Local" metrics, that's probably an error.
None
}
});
let params = ServerProcessParameters {
server_path: propolis_server,
data_dir: framework.tmp_directory.as_path(),
server_addr: SocketAddrV4::new(
Ipv4Addr::new(127, 0, 0, 1),
server_port,
),
metrics_addr,
vnc_addr: SocketAddrV4::new(
Ipv4Addr::new(127, 0, 0, 1),
vnc_port,
Expand Down
200 changes: 200 additions & 0 deletions phd-tests/framework/src/test_vm/metrics.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

use std::net::SocketAddr;
use std::time::Duration;

use dropshot::{
endpoint, ApiDescription, ConfigDropshot, HttpError, HttpResponseCreated,
HttpServer, HttpServerStarter, RequestContext, TypedBody,
};
use omicron_common::api::internal::nexus::{
ProducerEndpoint, ProducerKind, ProducerRegistrationResponse,
};
use oximeter::types::ProducerResults;
use slog::{Drain, Logger};
use tokio::sync::watch;
use tracing::trace;
use uuid::Uuid;

// Re-registration interval for tests. A long value here helps avoid log spew
// from Oximeter, which will re-register after about 1/6th of this interval
// elapses.
const INTERVAL: Duration = Duration::from_secs(300);

fn test_logger() -> Logger {
let dec = slog_term::PlainSyncDecorator::new(slog_term::TestStdoutWriter);
let drain = slog_term::FullFormat::new(dec).build().fuse();
Logger::root(drain, slog::o!("component" => "fake-cleanup-task"))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CI runs this causes non-Bunyan-formatted output to be interleaved with the regular PHD logs (see e.g. lines 892-894 of this log).

It'd be nice to have this obey the logging rules from the PHD runner. It might be a bit of a headache to plumb the emit_bunyan setting all the way down here, but I think it's probably better than the alternatives (e.g. putting framework Oximeter logs into a separate log file). The way I did this for the propolis-server logging params is to pass them from the test runner to the Framework, which packs them into a ServerProcessParameters that get passed to start_local_vm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i hadn't realized that we take emit_bunyan or plumb it anywhere in the first place! e88d07c does the plumbing here, and it's.. not all that bad really!

the crux of it is that it feels like the format for log output probably should flow from what the caller of phd-runner wanted, where we take --emit-bunyan. as-is propolis-server and Crucible guess the right output formats based on atty::is(), but the plumbing for this preference through PHD itself isn't too bad.

after looking at the logs from phd-run output i think it'd be interesting and useful to have separate log files for the fake Oximeter server, but that should also come with reorganizing the logs into per-test directories so we could have phd/migration_smoke/propolis.{stderr,stdout}.log and phd/migration_smoke/oximeter.log and ... etc. bonus being that Crucible disks are just a top-level mash of crucible-{uuid}-{port} and have fun cross-referencing that with tests... and hey, we could plumb that per-test log root in LogConfig!


struct OximeterProducerInfo {
addr: std::net::SocketAddr,
uuid: Uuid,
}

pub(crate) struct FakeOximeterServer {
server: HttpServer<FakeOximeterServerState>,
}

pub(crate) struct FakeOximeterServerState {
sampler_sender: watch::Sender<Option<OximeterProducerInfo>>,
sampler: watch::Receiver<Option<OximeterProducerInfo>>,
}

impl FakeOximeterServer {
pub fn local_addr(&self) -> SocketAddr {
self.server.local_addr()
}

pub fn sampler(&self) -> FakeOximeterSampler {
FakeOximeterSampler {
sampler: self.server.app_private().sampler.clone(),
}
}
}

pub struct FakeOximeterSampler {
sampler: watch::Receiver<Option<OximeterProducerInfo>>,
}

impl FakeOximeterServerState {
fn new() -> Self {
let (tx, rx) = watch::channel(None);

Self { sampler_sender: tx, sampler: rx }
}

async fn set_producer_info(&self, info: ProducerEndpoint) {
// Just don't know what to do with other ProducerKinds, if or when we'll
// see them here..
assert_eq!(info.kind, ProducerKind::Instance);

let new_sampler =
OximeterProducerInfo { addr: info.address, uuid: info.id };

// There should always be at least one Receiver on the channel since we
// hold one in `self`.
self.sampler_sender
.send(Some(new_sampler))
.expect("channel is subscribed");
}
}

impl FakeOximeterSampler {
/// Sample Propolis' Oximeter metrics, taking some function that determines
/// if a sample is satisfactory for the caller to proceed with.
///
/// `wait_for_propolis_stats` will poll the corresponding Oximeter producer
/// and call `f` with each returned set of results.
///
/// Panics if `f` does not return `Some` after some number of retries and
/// `ProducerResults` updates.
pub async fn wait_for_propolis_stats<U>(
&self,
f: impl Fn(ProducerResults) -> Option<U>,
) -> U {
let result = backoff::future::retry(
backoff::ExponentialBackoff {
max_interval: Duration::from_secs(1),
max_elapsed_time: Some(Duration::from_secs(10)),
..Default::default()
},
|| async {
let producer_results = self.sample_propolis_stats().await
.map_err(backoff::Error::transient)?;

if let Some(metrics) = f(producer_results) {
Ok(metrics)
} else {
Err(backoff::Error::transient(anyhow::anyhow!(
"full metrics sample not available or fresh enough (yet?)"
)))
}
},
)
.await;

result.expect("propolis-server Oximeter stats should become available")
}

/// Sample Propolis' Oximeter metrics, including the timestamp of the oldest
/// metric reflected in the sample.
///
/// Returns `None` for some kinds of incomplete stats or when no stats are
/// available at all.
async fn sample_propolis_stats(
&self,
) -> Result<ProducerResults, anyhow::Error> {
let metrics_url = {
self.sampler
.clone()
.wait_for(Option::is_some)
.await
.expect("can recv");
let sampler = self.sampler.borrow();
let stats = sampler.as_ref().expect("sampler does not become None");
format!("http://{}/{}", stats.addr, stats.uuid)
};
let res = reqwest::Client::new()
.get(metrics_url)
.send()
.await
.expect("can send oximeter stats request");
assert!(
res.status().is_success(),
"failed to fetch stats from propolis-server"
);
trace!(?res, "got stats response");
Ok(res.json::<ProducerResults>().await?)
}
}

// Stub functionality for our fake Nexus that test Oximeter produces
// (`propolis-server`) will register with.
#[endpoint {
method = POST,
path = "/metrics/producers",
}]
async fn register_producer(
rqctx: RequestContext<FakeOximeterServerState>,
producer_info: TypedBody<ProducerEndpoint>,
) -> Result<HttpResponseCreated<ProducerRegistrationResponse>, HttpError> {
let info = producer_info.into_inner();
trace!(?info, "producer registration");
rqctx.context().set_producer_info(info).await;

Ok(HttpResponseCreated(ProducerRegistrationResponse {
lease_duration: INTERVAL,
}))
}

// Start a Dropshot server mocking the Oximeter registration endpoint we would
// expect from Nexus.
pub fn spawn_fake_oximeter_server() -> FakeOximeterServer {
let log = test_logger();

let mut api = ApiDescription::new();
api.register(register_producer).expect("Expected to register endpoint");
let server = HttpServerStarter::new(
&ConfigDropshot {
bind_address: "[::1]:0".parse().unwrap(),
request_body_max_bytes: 2048,
..Default::default()
},
api,
FakeOximeterServerState::new(),
&log,
)
.expect("Expected to start Dropshot server")
.start();

slog::info!(
log,
"fake nexus test server listening";
"address" => ?server.local_addr(),
);

FakeOximeterServer { server }
}
Loading
Loading