Skip to content

Commit 910c81e

Browse files
authored
PHD: add tests for migration of running processes (#623)
In real life, it's likely that VM migration may occur while one or more user processes are running on the guest. We don't have a lot of testing for what happens to userspace processes during migration. Therefore, I've added some tests that create a process on the guest, write a bunch of data into variables in memory, suspend the process, migrate the guest, and attempt to read the data back out of memory on the new host and check that it's still correct. In addition to a simple test that performs one migration, I've also added tests for correctness in the face of a _failed_ migration attempt. These tests force a migration attempt to fail while a process is running, and then check that the process hasn't been messed up after a successful second migration attempt. In order facilitate testing migration after a failed migration, this PR introduces a test-only virtual device which can be configured to fail during migration, either while exporting the device or while importing it. The device can be configured to fail a set number of times, to allow tests that force a migration failure and then perform a subsequent migration that does not fail. Since a virtual device whose only purpose is to break Propolis seems like something that we don't want to create in production, `propolis-server` will only permit the migration-failure test device to be configured in the TOML config file, and it is *not* added to the `InstanceSpec` public API. `propolis-server` also ignores any migration-failure devices configured in the TOML if it was built with the "omicron-build" feature, so we won't accidentally inject failures into release builds. The primary intention behind these tests is to ensure that all dirty pages of the guest's memory are migrated correctly (see #387). With the current approach to RAM transfer, we don't run into issues with this, even in the case of a failed migration, because the `RamPushPrePause` phase always offers *all* guest RAM, regardless of whether it's dirty. However, if we change the RAM transfer behavior to only offer dirty pages both in the `RamPushPrePause` and `RamPushPostPause` migration phases, the migration-failure tests will now fail, because the failed migration will have cleared dirty bits on any dirty guest pages, and we now don't transfer the guest's memory correctly on the second migration. These tests will allow us to verify that a fix for #387, where we no longer offer all RAM in the `RamPushPrePause` phase, is working correctly.
1 parent bc2df86 commit 910c81e

File tree

10 files changed

+436
-14
lines changed

10 files changed

+436
-14
lines changed

bin/propolis-server/src/lib/initializer.rs

+52-4
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,10 @@ use std::os::unix::fs::FileTypeExt;
1010
use std::sync::Arc;
1111
use std::time::{SystemTime, UNIX_EPOCH};
1212

13+
use crate::serial::Serial;
14+
use crate::server::CrucibleBackendMap;
1315
use crucible_client_types::VolumeConstructionRequest;
16+
pub use nexus_client::Client as NexusClient;
1417
use oximeter::types::ProducerRegistry;
1518
use propolis::block;
1619
use propolis::chardev::{self, BlockingSource, Source};
@@ -31,10 +34,6 @@ use propolis::vmm::{self, Builder, Machine};
3134
use propolis_api_types::instance_spec::{self, v0::InstanceSpecV0};
3235
use slog::info;
3336

34-
use crate::serial::Serial;
35-
use crate::server::CrucibleBackendMap;
36-
pub use nexus_client::Client as NexusClient;
37-
3837
use anyhow::{Context, Result};
3938

4039
// Arbitrary ROM limit for now
@@ -575,6 +574,55 @@ impl<'a> MachineInitializer<'a> {
575574
Ok(())
576575
}
577576

577+
#[cfg(not(feature = "omicron-build"))]
578+
pub fn initialize_test_devices(
579+
&self,
580+
toml_cfg: &std::collections::BTreeMap<
581+
String,
582+
propolis_server_config::Device,
583+
>,
584+
) -> Result<(), Error> {
585+
use propolis::hw::testdev::{
586+
MigrationFailureDevice, MigrationFailures,
587+
};
588+
589+
if let Some(dev) = toml_cfg.get(MigrationFailureDevice::NAME) {
590+
const FAIL_EXPORTS: &str = "fail_exports";
591+
const FAIL_IMPORTS: &str = "fail_imports";
592+
let fail_exports = dev
593+
.options
594+
.get(FAIL_EXPORTS)
595+
.and_then(|val| val.as_integer())
596+
.unwrap_or(0);
597+
let fail_imports = dev
598+
.options
599+
.get(FAIL_IMPORTS)
600+
.and_then(|val| val.as_integer())
601+
.unwrap_or(0);
602+
603+
if fail_exports <= 0 && fail_imports <= 0 {
604+
info!(
605+
self.log,
606+
"migration failure device will not fail, as both
607+
`{FAIL_EXPORTS}` and `{FAIL_IMPORTS}` are 0";
608+
FAIL_EXPORTS => ?fail_exports,
609+
FAIL_IMPORTS => ?fail_imports,
610+
);
611+
}
612+
613+
let dev = MigrationFailureDevice::create(
614+
&self.log,
615+
MigrationFailures {
616+
exports: fail_exports as usize,
617+
imports: fail_imports as usize,
618+
},
619+
);
620+
self.inv.register(&dev)?;
621+
}
622+
623+
Ok(())
624+
}
625+
578626
#[cfg(feature = "falcon")]
579627
pub fn initialize_softnpu_ports(
580628
&self,

bin/propolis-server/src/lib/server.rs

+2-5
Original file line numberDiff line numberDiff line change
@@ -497,18 +497,15 @@ async fn instance_ensure_common(
497497
// admittedly a big kludge until this can be better refactored.
498498
let vm = {
499499
let properties = properties.clone();
500-
let use_reservoir = server_context.static_config.use_reservoir;
501-
let bootrom = server_context.static_config.vm.bootrom.clone();
500+
let server_context = server_context.clone();
502501
let log = server_context.log.clone();
503502
let hdl = tokio::runtime::Handle::current();
504503
let ctrl_hdl = hdl.clone();
505-
506504
let vm_hdl = hdl.spawn_blocking(move || {
507505
VmController::new(
508506
instance_spec,
509507
properties,
510-
use_reservoir,
511-
bootrom,
508+
&server_context.static_config,
512509
producer_registry,
513510
nexus_client,
514511
log,

bin/propolis-server/src/lib/vm/mod.rs

+12-3
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@ use std::{
3535
collections::{BTreeMap, VecDeque},
3636
fmt::Debug,
3737
net::SocketAddr,
38-
path::PathBuf,
3938
pin::Pin,
4039
sync::{Arc, Condvar, Mutex, Weak},
4140
task::{Context, Poll},
@@ -66,6 +65,7 @@ use crate::{
6665
initializer::{build_instance, MachineInitializer},
6766
migrate::MigrateError,
6867
serial::Serial,
68+
server::StaticConfig,
6969
vm::request_queue::ExternalRequest,
7070
};
7171

@@ -398,14 +398,14 @@ impl VmController {
398398
pub fn new(
399399
instance_spec: VersionedInstanceSpec,
400400
properties: InstanceProperties,
401-
use_reservoir: bool,
402-
bootrom: PathBuf,
401+
&StaticConfig { vm: ref toml_config, use_reservoir, .. }: &StaticConfig,
403402
oximeter_registry: Option<ProducerRegistry>,
404403
nexus_client: Option<NexusClient>,
405404
log: Logger,
406405
runtime_hdl: tokio::runtime::Handle,
407406
stop_ch: oneshot::Sender<()>,
408407
) -> anyhow::Result<Arc<Self>> {
408+
let bootrom = &toml_config.bootrom;
409409
info!(log, "initializing new VM";
410410
"spec" => #?instance_spec,
411411
"properties" => #?properties,
@@ -460,6 +460,15 @@ impl VmController {
460460
init.initialize_qemu_debug_port()?;
461461
init.initialize_qemu_pvpanic(properties.id)?;
462462
init.initialize_network_devices(&chipset)?;
463+
464+
#[cfg(not(feature = "omicron-build"))]
465+
init.initialize_test_devices(&toml_config.devices)?;
466+
#[cfg(feature = "omicron-build")]
467+
info!(
468+
log,
469+
"`omicron-build` feature enabled, ignoring any test devices"
470+
);
471+
463472
#[cfg(feature = "falcon")]
464473
init.initialize_softnpu_ports(&chipset)?;
465474
#[cfg(feature = "falcon")]

crates/propolis-server-config/src/lib.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ pub struct PciBridge {
8282

8383
/// A hard-coded device, either enabled by default or accessible locally
8484
/// on a machine.
85-
#[derive(Serialize, Deserialize, Debug, PartialEq)]
85+
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq)]
8686
pub struct Device {
8787
pub driver: String,
8888

lib/propolis/src/hw/mod.rs

+1
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,6 @@ pub mod nvme;
1010
pub mod pci;
1111
pub mod ps2;
1212
pub mod qemu;
13+
pub mod testdev;
1314
pub mod uart;
1415
pub mod virtio;

lib/propolis/src/hw/testdev.rs

+121
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
4+
5+
//! Devices intended for testing purposes.
6+
//!
7+
//! These devices do things which are generally unwanted in real life, such as
8+
//! "intentionally breaking Propolis", "intentionally breaking the guest OS", or
9+
//! some combination of the two.
10+
use std::sync::{
11+
atomic::{AtomicUsize, Ordering},
12+
Arc,
13+
};
14+
15+
use crate::inventory::Entity;
16+
use crate::migrate::*;
17+
18+
use serde::{Deserialize, Serialize};
19+
use slog::info;
20+
21+
/// A test device for simulating migration failures.
22+
pub struct MigrationFailureDevice {
23+
log: slog::Logger,
24+
exports: AtomicUsize,
25+
imports: AtomicUsize,
26+
fail: MigrationFailures,
27+
}
28+
29+
pub struct MigrationFailures {
30+
pub exports: usize,
31+
pub imports: usize,
32+
}
33+
34+
#[derive(Clone, Default, Deserialize, Serialize)]
35+
struct MigrationFailurePayloadV1 {}
36+
37+
impl MigrationFailureDevice {
38+
pub const NAME: &'static str = "test-migration-failure";
39+
40+
pub fn create(log: &slog::Logger, fail: MigrationFailures) -> Arc<Self> {
41+
let log =
42+
log.new(slog::o!("component" => "testdev", "dev" => Self::NAME));
43+
info!(log,
44+
"Injecting simulated migration failures";
45+
"fail_exports" => %fail.exports,
46+
"fail_imports" => %fail.imports,
47+
);
48+
Arc::new(Self {
49+
log,
50+
exports: AtomicUsize::new(0),
51+
imports: AtomicUsize::new(0),
52+
fail,
53+
})
54+
}
55+
}
56+
57+
impl Entity for MigrationFailureDevice {
58+
fn type_name(&self) -> &'static str {
59+
MigrationFailureDevice::NAME
60+
}
61+
fn migrate(&self) -> Migrator {
62+
Migrator::Single(self)
63+
}
64+
}
65+
66+
impl MigrateSingle for MigrationFailureDevice {
67+
fn export(
68+
&self,
69+
_ctx: &MigrateCtx,
70+
) -> Result<PayloadOutput, MigrateStateError> {
71+
let export_num = self.exports.fetch_add(1, Ordering::Relaxed);
72+
if export_num < self.fail.exports {
73+
info!(
74+
self.log,
75+
"failing export";
76+
"export_num" => %export_num,
77+
"fail_exports" => %self.fail.exports
78+
);
79+
return Err(MigrateStateError::Io(std::io::Error::new(
80+
std::io::ErrorKind::Other,
81+
"somebody set up us the bomb",
82+
)));
83+
}
84+
85+
info!(
86+
self.log,
87+
"exporting device";
88+
"export_num" => %export_num,
89+
);
90+
Ok(MigrationFailurePayloadV1 {}.into())
91+
}
92+
93+
fn import(
94+
&self,
95+
mut offer: PayloadOffer,
96+
_ctx: &MigrateCtx,
97+
) -> Result<(), MigrateStateError> {
98+
let import_num = self.imports.fetch_add(1, Ordering::Relaxed);
99+
let fail = import_num < self.fail.imports;
100+
info!(
101+
self.log,
102+
"importing device";
103+
"import_num" => %import_num,
104+
"will_fail" => %fail,
105+
);
106+
let MigrationFailurePayloadV1 {} = offer.parse()?;
107+
if fail {
108+
info!(self.log, "failing import");
109+
return Err(MigrateStateError::ImportFailed(
110+
"you have no chance to survive, make your time".to_string(),
111+
));
112+
}
113+
Ok(())
114+
}
115+
}
116+
117+
impl Schema<'_> for MigrationFailurePayloadV1 {
118+
fn id() -> SchemaId {
119+
("testdev-migration-failure", 1)
120+
}
121+
}

phd-tests/framework/src/test_vm/config.rs

+36
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
// License, v. 2.0. If a copy of the MPL was not distributed with this
33
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
44

5+
use std::collections::BTreeMap;
56
use std::sync::Arc;
67

78
use anyhow::Context;
@@ -44,8 +45,11 @@ pub struct VmConfig {
4445
bootrom_artifact: String,
4546
boot_disk: DiskRequest,
4647
data_disks: Vec<DiskRequest>,
48+
devices: BTreeMap<String, propolis_server_config::Device>,
4749
}
4850

51+
const MIGRATION_FAILURE_DEVICE: &str = "test-migration-failure";
52+
4953
impl VmConfig {
5054
pub(crate) fn new(
5155
vm_name: &str,
@@ -68,6 +72,7 @@ impl VmConfig {
6872
bootrom_artifact: bootrom.to_owned(),
6973
boot_disk,
7074
data_disks: Vec::new(),
75+
devices: BTreeMap::new(),
7176
}
7277
}
7378

@@ -86,6 +91,29 @@ impl VmConfig {
8691
self
8792
}
8893

94+
pub fn named(&mut self, name: impl ToString) -> &mut Self {
95+
self.vm_name = name.to_string();
96+
self
97+
}
98+
99+
pub fn fail_migration_exports(&mut self, exports: u32) -> &mut Self {
100+
self.devices
101+
.entry(MIGRATION_FAILURE_DEVICE.to_owned())
102+
.or_insert_with(default_migration_failure_device)
103+
.options
104+
.insert("fail_exports".to_string(), exports.into());
105+
self
106+
}
107+
108+
pub fn fail_migration_imports(&mut self, imports: u32) -> &mut Self {
109+
self.devices
110+
.entry(MIGRATION_FAILURE_DEVICE.to_owned())
111+
.or_insert_with(default_migration_failure_device)
112+
.options
113+
.insert("fail_imports".to_string(), imports.into());
114+
self
115+
}
116+
89117
pub fn boot_disk(
90118
&mut self,
91119
artifact: &str,
@@ -132,6 +160,7 @@ impl VmConfig {
132160
let config_toml_contents =
133161
toml::ser::to_string(&propolis_server_config::Config {
134162
bootrom: bootrom.clone().into(),
163+
devices: self.devices.clone(),
135164
..Default::default()
136165
})
137166
.context("serializing Propolis server config")?;
@@ -241,3 +270,10 @@ impl VmConfig {
241270
})
242271
}
243272
}
273+
274+
fn default_migration_failure_device() -> propolis_server_config::Device {
275+
propolis_server_config::Device {
276+
driver: MIGRATION_FAILURE_DEVICE.to_owned(),
277+
options: Default::default(),
278+
}
279+
}

phd-tests/framework/src/test_vm/mod.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ impl TestVm {
231231
self.spec.clone()
232232
}
233233

234-
pub(crate) fn environment_spec(&self) -> EnvironmentSpec {
234+
pub fn environment_spec(&self) -> EnvironmentSpec {
235235
self.environment_spec.clone()
236236
}
237237

0 commit comments

Comments
 (0)