-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test oximeter metrics end to end #855
Changes from 11 commits
9de65a2
2c50316
998f5a3
cea54ec
5319622
fe6ab88
f8e5901
20e0d3f
c9c348a
be7f609
61734a1
32a359a
e88d07c
e42be60
7bbbee3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
// This Source Code Form is subject to the terms of the Mozilla Public | ||
// License, v. 2.0. If a copy of the MPL was not distributed with this | ||
// file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
||
use std::net::SocketAddr; | ||
iximeow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
use std::time::Duration; | ||
|
||
use dropshot::{ | ||
endpoint, ApiDescription, ConfigDropshot, HttpError, HttpResponseCreated, | ||
HttpServer, HttpServerStarter, RequestContext, TypedBody, | ||
}; | ||
use omicron_common::api::internal::nexus::{ | ||
ProducerEndpoint, ProducerKind, ProducerRegistrationResponse, | ||
}; | ||
use oximeter::types::ProducerResults; | ||
use slog::{Drain, Logger}; | ||
use tokio::sync::watch; | ||
use tracing::trace; | ||
use uuid::Uuid; | ||
|
||
// Re-registration interval for tests. A long value here helps avoid log spew | ||
// from Oximeter, which will re-register after about 1/6th of this interval | ||
// elapses. | ||
const INTERVAL: Duration = Duration::from_secs(300); | ||
|
||
fn test_logger() -> Logger { | ||
let dec = slog_term::PlainSyncDecorator::new(slog_term::TestStdoutWriter); | ||
let drain = slog_term::FullFormat::new(dec).build().fuse(); | ||
Logger::root(drain, slog::o!("component" => "fake-cleanup-task")) | ||
iximeow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In CI runs this causes non-Bunyan-formatted output to be interleaved with the regular PHD logs (see e.g. lines 892-894 of this log). It'd be nice to have this obey the logging rules from the PHD runner. It might be a bit of a headache to plumb the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i hadn't realized that we take the crux of it is that it feels like the format for log output probably should flow from what the caller of after looking at the logs from |
||
|
||
struct OximeterProducerInfo { | ||
addr: std::net::SocketAddr, | ||
uuid: Uuid, | ||
} | ||
|
||
pub(crate) struct FakeOximeterServer { | ||
server: HttpServer<FakeOximeterServerState>, | ||
} | ||
|
||
pub(crate) struct FakeOximeterServerState { | ||
sampler_sender: watch::Sender<Option<OximeterProducerInfo>>, | ||
sampler: watch::Receiver<Option<OximeterProducerInfo>>, | ||
} | ||
|
||
impl FakeOximeterServer { | ||
pub fn local_addr(&self) -> SocketAddr { | ||
self.server.local_addr() | ||
} | ||
|
||
pub fn sampler(&self) -> FakeOximeterSampler { | ||
FakeOximeterSampler { | ||
sampler: self.server.app_private().sampler.clone(), | ||
} | ||
} | ||
} | ||
|
||
pub struct FakeOximeterSampler { | ||
sampler: watch::Receiver<Option<OximeterProducerInfo>>, | ||
} | ||
|
||
impl FakeOximeterServerState { | ||
fn new() -> Self { | ||
let (tx, rx) = watch::channel(None); | ||
|
||
Self { sampler_sender: tx, sampler: rx } | ||
} | ||
|
||
async fn set_producer_info(&self, info: ProducerEndpoint) { | ||
// Just don't know what to do with other ProducerKinds, if or when we'll | ||
// see them here.. | ||
assert_eq!(info.kind, ProducerKind::Instance); | ||
|
||
let new_sampler = | ||
OximeterProducerInfo { addr: info.address, uuid: info.id }; | ||
|
||
// There should always be at least one Receiver on the channel since we | ||
// hold one in `self`. | ||
self.sampler_sender | ||
.send(Some(new_sampler)) | ||
.expect("channel is subscribed"); | ||
} | ||
} | ||
|
||
impl FakeOximeterSampler { | ||
/// Sample Propolis' Oximeter metrics, taking some function that determines | ||
/// if a sample is satisfactory for the caller to proceed with. | ||
/// | ||
/// `wait_for_propolis_stats` will poll the corresponding Oximeter producer | ||
/// and call `f` with each returned set of results. | ||
/// | ||
/// Panics if `f` does not return `Some` after some number of retries and | ||
/// `ProducerResults` updates. | ||
pub async fn wait_for_propolis_stats<U>( | ||
&self, | ||
f: impl Fn(ProducerResults) -> Option<U>, | ||
) -> U { | ||
let result = backoff::future::retry( | ||
backoff::ExponentialBackoff { | ||
max_interval: Duration::from_secs(1), | ||
max_elapsed_time: Some(Duration::from_secs(10)), | ||
..Default::default() | ||
}, | ||
|| async { | ||
let producer_results = self.sample_propolis_stats().await | ||
.map_err(backoff::Error::transient)?; | ||
|
||
if let Some(metrics) = f(producer_results) { | ||
Ok(metrics) | ||
} else { | ||
Err(backoff::Error::transient(anyhow::anyhow!( | ||
"full metrics sample not available or fresh enough (yet?)" | ||
))) | ||
} | ||
}, | ||
) | ||
.await; | ||
|
||
result.expect("propolis-server Oximeter stats should become available") | ||
} | ||
|
||
/// Sample Propolis' Oximeter metrics, including the timestamp of the oldest | ||
/// metric reflected in the sample. | ||
/// | ||
/// Returns `None` for some kinds of incomplete stats or when no stats are | ||
/// available at all. | ||
async fn sample_propolis_stats( | ||
&self, | ||
) -> Result<ProducerResults, anyhow::Error> { | ||
let metrics_url = { | ||
self.sampler | ||
.clone() | ||
.wait_for(Option::is_some) | ||
.await | ||
.expect("can recv"); | ||
let sampler = self.sampler.borrow(); | ||
let stats = sampler.as_ref().expect("sampler does not become None"); | ||
format!("http://{}/{}", stats.addr, stats.uuid) | ||
}; | ||
let res = reqwest::Client::new() | ||
.get(metrics_url) | ||
.send() | ||
.await | ||
.expect("can send oximeter stats request"); | ||
assert!( | ||
res.status().is_success(), | ||
"failed to fetch stats from propolis-server" | ||
); | ||
trace!(?res, "got stats response"); | ||
Ok(res.json::<ProducerResults>().await?) | ||
} | ||
} | ||
|
||
// Stub functionality for our fake Nexus that test Oximeter produces | ||
iximeow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// (`propolis-server`) will register with. | ||
#[endpoint { | ||
method = POST, | ||
path = "/metrics/producers", | ||
}] | ||
async fn register_producer( | ||
rqctx: RequestContext<FakeOximeterServerState>, | ||
producer_info: TypedBody<ProducerEndpoint>, | ||
) -> Result<HttpResponseCreated<ProducerRegistrationResponse>, HttpError> { | ||
let info = producer_info.into_inner(); | ||
trace!(?info, "producer registration"); | ||
rqctx.context().set_producer_info(info).await; | ||
|
||
Ok(HttpResponseCreated(ProducerRegistrationResponse { | ||
lease_duration: INTERVAL, | ||
})) | ||
} | ||
|
||
// Start a Dropshot server mocking the Oximeter registration endpoint we would | ||
// expect from Nexus. | ||
pub fn spawn_fake_oximeter_server() -> FakeOximeterServer { | ||
let log = test_logger(); | ||
|
||
let mut api = ApiDescription::new(); | ||
api.register(register_producer).expect("Expected to register endpoint"); | ||
let server = HttpServerStarter::new( | ||
&ConfigDropshot { | ||
bind_address: "[::1]:0".parse().unwrap(), | ||
request_body_max_bytes: 2048, | ||
..Default::default() | ||
}, | ||
api, | ||
FakeOximeterServerState::new(), | ||
&log, | ||
) | ||
.expect("Expected to start Dropshot server") | ||
.start(); | ||
|
||
slog::info!( | ||
log, | ||
"fake nexus test server listening"; | ||
"address" => ?server.local_addr(), | ||
); | ||
|
||
FakeOximeterServer { server } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-host PHD is obviously not very well fleshed-out yet, but I think the way I would develop it might make this enum obsolete. My 30,000-foot sketch of this is
phd-runner
and executes the actual test casesphd-agent
that exposes a Dropshot API that handles the tasks that need to be performed on remote machines (e.g. starting/stopping Propolis servers)IIUC, even with an architecture like this, the fake Oximeter server still needs to be "local," even if the metrics are coming from a secondary host: submitted samples need to show up in the runner's
Framework
so that they can be queried by the tests running in that process.This is obviated somewhat by switching to Clickhouse, I think, since then (I presume) there would be a single Clickhouse server instance that all PHD processes (both runner and agent) could insert data into, and the framework on the runner would just query that server. Even so, it still seems possible in that case for remote Propolis processes just to submit their samples to the collector in the primary runner.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was figuring in a case like that we'd still use
Local
to sendlocalhost:{oximeter_port}
instead of something fancier, but overall that would boil it down to "wants oximeter: yes/no". i do think the plumbing to express that here today is a bit more awkward.. we don't really have a clean "set up dependent services" -> "startpropolis-server
" split in PHD and that does get to be more reorganizing than i want to do here 😅