Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FIOFFS on illumos instead of fsync #1148

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions downstairs/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,4 @@ asm = ["usdt/asm"]
default = []
zfs_snapshot = []
integration-tests = [] # Enables creating SQLite volumes
omicron-build = []
21 changes: 21 additions & 0 deletions downstairs/src/extent.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,15 @@ pub(crate) trait ExtentInner: Send + Debug {
job_id: JobOrReconciliationId,
) -> Result<(), CrucibleError>;

/// After both user data and any extent metadata have been flushed to
/// durable storage, perform accounting or clean-up.
fn post_flush(
&mut self,
new_flush: u64,
new_gen: u64,
job_id: JobOrReconciliationId,
) -> Result<(), CrucibleError>;

fn read(
&mut self,
job_id: JobId,
Expand Down Expand Up @@ -590,6 +599,18 @@ impl Extent {
inner.flush(new_flush, new_gen, job_id)
}

#[instrument]
pub(crate) async fn post_flush<I: Into<JobOrReconciliationId> + Debug>(
&self,
new_flush: u64,
new_gen: u64,
id: I, // only used for logging
) -> Result<(), CrucibleError> {
let job_id = id.into();
let mut inner = self.inner.lock().await;
inner.post_flush(new_flush, new_gen, job_id)
}

pub async fn get_meta_info(&self) -> ExtentMeta {
let inner = self.inner.lock().await;
ExtentMeta {
Expand Down
68 changes: 46 additions & 22 deletions downstairs/src/extent_inner_raw.rs
Original file line number Diff line number Diff line change
Expand Up @@ -325,24 +325,42 @@ impl ExtentInner for RawInner {
// operation atomic.
self.set_flush_number(new_flush, new_gen)?;

// Now, we fsync to ensure data is flushed to disk. It's okay to crash
// before this point, because setting the flush number is atomic.
cdt::extent__flush__file__start!(|| {
(job_id.get(), self.extent_number, 0)
});
if let Err(e) = self.file.sync_all() {
/*
* XXX Retry? Mark extent as broken?
*/
return Err(CrucibleError::IoError(format!(
"extent {}: fsync 1 failure: {e:?}",
self.extent_number,
)));
if cfg!(feature = "omicron-build") {
// no-op, FIOFFS will be called in region_flush
} else {
// Now, we fsync to ensure data is flushed to disk. It's okay to crash
// before this point, because setting the flush number is atomic.
cdt::extent__flush__file__start!(|| {
(job_id.get(), self.extent_number, 0)
});

if let Err(e) = self.file.sync_all() {
/*
* XXX Retry? Mark extent as broken?
*/
return Err(CrucibleError::IoError(format!(
"extent {}: fsync 1 failure: {e:?}",
self.extent_number,
)));
}

cdt::extent__flush__file__done!(|| {
(job_id.get(), self.extent_number, 0)
});
}

cdt::extent__flush__done!(|| { (job_id.get(), self.extent_number, 0) });
Copy link
Contributor

@mkeeter mkeeter Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I weakly feel like this should be at the beginning of post_flush, so that it accounts for the actual flush time (even though that's amortized across many extents)


Ok(())
}

fn post_flush(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It weirds me out that Extent::flush doesn't actually flush, depending on omicron-build. I understand why that's the case, but would like to propose an alternative API. I think we should break flushing behavior into three functions:

  • pre_flush contains everything through self.set_flush_number(..) in the current implementation
  • flush_inner implements the else branch, i.e. calling self.file.sync_all (unconditionally!)
  • post_flush is the same as in this PR

Then, we can have a default implementation of flush on trait ExtentInner which calls these functions one after the other, meaning unit tests that want to flush a single extent can just call flush(). Right now, those unit tests are no longer actually syncing the file to disk!

More importantly, the decision to skip calling flush_inner (i.e. just calling pre_flush + post_flush) is moved into the Region::region_flush implementation, which is right next to the FIOFFS call, which makes it more obvious what's going on.

&mut self,
_new_flush: u64,
_new_gen: u64,
_job_id: JobOrReconciliationId,
) -> Result<(), CrucibleError> {
self.context_slot_dirty.fill(0);
cdt::extent__flush__file__done!(|| {
(job_id.get(), self.extent_number, 0)
});

// Check for fragmentation in the context slots leading to worse
// performance, and defragment if that's the case.
Expand All @@ -352,15 +370,11 @@ impl ExtentInner for RawInner {
.unwrap_or(0);
self.extra_syscall_count = 0;
self.extra_syscall_denominator = 0;
let r = if extra_syscalls_per_rw > DEFRAGMENT_THRESHOLD {
if extra_syscalls_per_rw > DEFRAGMENT_THRESHOLD {
self.defragment()
} else {
Ok(())
};

cdt::extent__flush__done!(|| { (job_id.get(), self.extent_number, 0) });

r
}
}

#[cfg(test)]
Expand Down Expand Up @@ -1852,6 +1866,7 @@ mod test {

// Flush! This will mark all slots as synched
inner.flush(12, 12, JobId(11).into())?;
inner.post_flush(12, 12, JobId(11).into())?;
assert_eq!(inner.active_context[0], ContextSlot::B);
assert_eq!(inner.context_slot_dirty[0], 0b00);

Expand Down Expand Up @@ -1904,6 +1919,7 @@ mod test {

// Flush! This will mark all slots as synched
inner.flush(12, 12, JobId(11).into())?;
inner.post_flush(12, 12, JobId(11).into())?;
assert_eq!(inner.active_context[0], ContextSlot::A);
assert_eq!(inner.context_slot_dirty[0], 0b00);

Expand Down Expand Up @@ -2029,6 +2045,7 @@ mod test {
assert_eq!(inner.extra_syscall_count, 0);
assert_eq!(inner.extra_syscall_denominator, 5);
inner.flush(10, 10, JobId(10).into())?;
inner.post_flush(10, 10, JobId(10).into())?;
assert!(inner.context_slot_dirty.iter().all(|v| *v == 0));

// This should not have changed active context slots!
Expand Down Expand Up @@ -2080,6 +2097,7 @@ mod test {
// trigger defragmentation; after the flush, every context slot should
// be in array A.
inner.flush(11, 11, JobId(11).into())?;
inner.post_flush(11, 11, JobId(11).into())?;

for i in 0..10 {
assert_eq!(
Expand Down Expand Up @@ -2132,6 +2150,7 @@ mod test {
assert_eq!(inner.extra_syscall_count, 0);
assert_eq!(inner.extra_syscall_denominator, 3);
inner.flush(10, 10, JobId(10).into())?;
inner.post_flush(10, 10, JobId(10).into())?;

// This should not have changed active context slots!
for i in 0..10 {
Expand Down Expand Up @@ -2185,6 +2204,7 @@ mod test {
// trigger defragmentation; after the flush, every context slot should
// be in array B (which minimizes copies)
inner.flush(11, 11, JobId(11).into())?;
inner.post_flush(11, 11, JobId(11).into())?;

for i in 0..10 {
assert_eq!(
Expand Down Expand Up @@ -2239,6 +2259,7 @@ mod test {
assert_eq!(inner.extra_syscall_count, 0);
assert_eq!(inner.extra_syscall_denominator, 3);
inner.flush(10, 10, JobId(10).into())?;
inner.post_flush(10, 10, JobId(10).into())?;

// This should not have changed active context slots!
for i in 0..10 {
Expand Down Expand Up @@ -2293,6 +2314,7 @@ mod test {
// trigger defragmentation; after the flush, every context slot should
// be in array A (which minimizes copies)
inner.flush(11, 11, JobId(11).into())?;
inner.post_flush(11, 11, JobId(11).into())?;

for i in 0..10 {
assert_eq!(
Expand Down Expand Up @@ -2347,6 +2369,7 @@ mod test {
assert_eq!(inner.extra_syscall_count, 0);
assert_eq!(inner.extra_syscall_denominator, 2);
inner.flush(10, 10, JobId(10).into())?;
inner.post_flush(10, 10, JobId(10).into())?;

// This should not have changed active context slots!
for i in 0..10 {
Expand Down Expand Up @@ -2400,6 +2423,7 @@ mod test {
// This write didn't add enough extra syscalls to trigger
// defragmentation.
inner.flush(11, 11, JobId(11).into())?;
inner.post_flush(11, 11, JobId(11).into())?;

// These should be the same!
for i in 0..10 {
Expand Down
64 changes: 47 additions & 17 deletions downstairs/src/extent_inner_sqlite.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,33 +75,63 @@ impl ExtentInner for SqliteInner {
new_gen: u64,
job_id: JobOrReconciliationId,
) -> Result<(), CrucibleError> {
if !self.dirty()? {
/*
* If we have made no writes to this extent since the last flush,
* we do not need to update the extent on disk
*/
return Ok(());
}

// We put all of our metadata updates into a single write to make this
// operation atomic.
self.set_flush_number(new_flush, new_gen)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suspicious about this change.

If we lose power right after this line, the metadata DB will tell us that we're at a particular flush (new_flush) and dirty = 0, but the data in the extent file has not necessarily been persisted to disk!

I think the existing implementation – which updates flush number in the DB after syncing the extent file – is correct.


// Used for profiling
let n_dirty_blocks = self.dirty_blocks.len() as u64;

cdt::extent__flush__start!(|| {
(job_id.get(), self.extent_number, n_dirty_blocks)
});

/*
* We must first fsync to get any outstanding data written to disk.
* This must be done before we update the flush number.
*/
cdt::extent__flush__file__start!(|| {
(job_id.get(), self.extent_number, n_dirty_blocks)
});
if let Err(e) = self.file.sync_all() {
if cfg!(feature = "omicron-build") {
// no-op, FIOFFS will be called in region_flush
} else {
/*
* XXX Retry? Mark extent as broken?
* We must first fsync to get any outstanding data written to disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same vibes here about changing to pre_flush / flush_inner / post_flush; you'll obviously have to fix both extent formats if you change the trait)

* This must be done before we update the flush number.
*/
crucible_bail!(
IoError,
"extent {}: fsync 1 failure: {e:?}",
self.extent_number,
);
cdt::extent__flush__file__start!(|| {
(job_id.get(), self.extent_number, n_dirty_blocks)
});

if let Err(e) = self.file.sync_all() {
/*
* XXX Retry? Mark extent as broken?
*/
crucible_bail!(
IoError,
"extent {}: fsync 1 failure: {e:?}",
self.extent_number,
);
}

cdt::extent__flush__file__done!(|| {
(job_id.get(), self.extent_number, n_dirty_blocks)
});
}
cdt::extent__flush__file__done!(|| {
(job_id.get(), self.extent_number, n_dirty_blocks)
});

Ok(())
}

fn post_flush(
&mut self,
new_flush: u64,
new_gen: u64,
job_id: JobOrReconciliationId,
) -> Result<(), CrucibleError> {
// Used for profiling
let n_dirty_blocks = self.dirty_blocks.len() as u64;

// Clear old block contexts. In order to be crash consistent, only
// perform this after the extent fsync is done. For each block
Expand Down
42 changes: 41 additions & 1 deletion downstairs/src/region.rs
Original file line number Diff line number Diff line change
Expand Up @@ -807,7 +807,7 @@ impl Region {
*/
#[instrument]
pub(crate) async fn region_flush_extent<
I: Into<JobOrReconciliationId> + Debug,
I: Into<JobOrReconciliationId> + Debug + Copy,
>(
&self,
eid: usize,
Expand All @@ -824,10 +824,13 @@ impl Region {
);

let extent = self.get_opened_extent(eid).await;

extent
.flush(flush_number, gen_number, job_id, &self.log)
.await?;

extent.post_flush(flush_number, gen_number, job_id).await?;

Ok(())
}

Expand Down Expand Up @@ -905,6 +908,43 @@ impl Region {
result??;
}

if cfg!(feature = "omicron-build") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using if cfg!(..) weirds me out because the code is still compiled even if the feature is disabled; I guess the ioctl code builds just fine on MacOS, but I'd rather not!

What about something like this?

#[cfg(feature = "omicron-build")]
{
    #[cfg(not(target_os = "illumos"))]
    compile_error!("cannot use FIOFFS on non-illumos systems");
    
    // .. normal code continues below

// If using `omicron-build`, then we're running on illumos and the
// region is backed by a ZFS dataset. Issue a _FIOFFS call, which
// will result in a `zfs_sync` to the entire region dataset. If this
// feature is enabled then the `extent.flush` calls above will _not_
// sync their data, but will update the flush and gen numbers and
// clear the dirty bit.
use std::io;
use std::os::fd::AsRawFd;
use std::ptr;

// "file system flush", defined in illumos' sys/filio.h
const FIOFFS: libc::c_int = 0x20000000 | (0x66 /*'f'*/ << 8) | 0x42 /*66*/;

// Open the region's mountpoint
let file = File::open(&self.dir)?;

let rc = unsafe {
libc::ioctl(
file.as_raw_fd(),
FIOFFS as _,
ptr::null_mut::<i32>(),
)
};

if rc != 0 {
return Err(io::Error::last_os_error().into());
}
Comment on lines +918 to +938
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using nix::ioctl lets this be a little cleaner:

Suggested change
use std::io;
use std::os::fd::AsRawFd;
use std::ptr;
// "file system flush", defined in illumos' sys/filio.h
const FIOFFS: libc::c_int = 0x20000000 | (0x66 /*'f'*/ << 8) | 0x42 /*66*/;
// Open the region's mountpoint
let file = File::open(&self.dir)?;
let rc = unsafe {
libc::ioctl(
file.as_raw_fd(),
FIOFFS as _,
ptr::null_mut::<i32>(),
)
};
if rc != 0 {
return Err(io::Error::last_os_error().into());
}
use std::os::fd::AsRawFd;
// Open the region's mountpoint
let file = File::open(&self.dir)?;
// "file system flush", defined in illumos' sys/filio.h
const FIOFFS_MAGIC: u8 = b'f';
const FIOFFS_TYPE_MODE: u8 = 66;
nix::ioctl_none!(zfs_fioffs, FIOFFS_MAGIC, FIOFFS_TYPE_MODE);
let rc = unsafe { zfs_fioffs(file.as_raw_fd()) };
if let Err(e) = rc {
let e: std::io::Error = e.into();
return Err(CrucibleError::from(e));
}

(we should double-check with DTrace that this actually hits the right function!)

}

// After the bits have been committed to durable storage, execute any post flush routine
// that needs to happen
for eid in &dirty_extents {
let extent = self.get_opened_extent(*eid).await;
extent.post_flush(flush_number, gen_number, job_id).await?;
}

// Now everything has succeeded, we can remove these extents from the
// flush candidates
match extent_limit {
Expand Down