Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(s2n-quic-dc): implement recv path packet pool #2483

Merged
merged 6 commits into from
Mar 5, 2025

Conversation

camshaft
Copy link
Contributor

@camshaft camshaft commented Feb 24, 2025

Description of changes:

The current recv buffer code relys on the s2n_quic_dc::msg::recv::Message module. This works well for an implementation that owns all of the messages that come in on a single socket. However, once we start getting into multiplexing multiple receivers on a single socket, the Message struct doesn't really enable that usecase, at least not efficiently.

We also run into issues if we ever want to support AF_XDP (since the XDP UMEM wants a contiguous region of packets), GRO (since we can't easily split Messages across packet boundaries without a copy), or dedicated recv tasks/threads that dispatch to the registered channels.

This implementation adds a new socket::recv::pool module, which enables all of these use cases. It works by passing around Descriptor pointers that point back to a region of memory (which can easily support AF_XDP). On drop, the descriptors get put back into a free list to make sure we don't leak segments. Additionally, the implementation will return a Segments iterator, which cheaply splits received packets up into GRO segments, which can be sent to distinct tasks/threads, without worrying about synchronizing those regions.

I've also included an example implementation of a worker that loops and receives packets from a blocking UDP socket and dispatches them using the new Router trait. The final implementation will probably be a bit more complicated - it probably needs to support shutting down, busy polling, etc, but it's a starting point to run tests.

Speaking of tests, I've got a test in a branch that sends and receives packets in a loop using this new allocator. Over localhost, I was able to do about 450 Gbps with GSO/GRO. The descriptor free code accounts for about 1% CPU in this example so it's possible it could be slightly improved but I think this is an OK start.

flamegraph

Testing:

I've included the model_test, which was run under miri and bolero with ASAN to try and catch any issues around safety violations. I did initially run into a few issues with leaks and stacked borrow violations, but have addressed all of those issues with this change. The model test is quite comprehensive on all of the operations that can happen with the pool, though it does not currently check for atomic issues, since we don't have loom set up in the s2n-quic-dc crate. That being said, everything that deals with atomics links to the std::sync::Arc code, to justify why it's there.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@camshaft camshaft force-pushed the camshaft/dc-packet-pool branch 2 times, most recently from 32464a7 to e8932f5 Compare February 24, 2025 22:59
@camshaft camshaft marked this pull request as ready for review February 24, 2025 23:29
@camshaft camshaft force-pushed the camshaft/dc-packet-pool branch from e8932f5 to a1e1b25 Compare February 24, 2025 23:32
@Mark-Simulacrum
Copy link
Collaborator

Over localhost, I was able to do about 450 Gbps with GSO/GRO.

That bandwidth feels very close to (main) memory bandwidth (probably ~100 GB/second theoretical) -- and you're hitting 57 GB/second. So that seems good, though a little surprising to me... historically a single UDP socket has peaked at ~300k packets/second in my testing which is around 300_000*(2**16)*8/1000/1000/1000 = 157.2864 Gbps assuming "perfect" 65k packets.

@camshaft
Copy link
Contributor Author

historically a single UDP socket has peaked at ~300k packets/second

That's probably the difference here - I had a socket per core and reuse port on the receiver. I should have mentioned that in my description.

Copy link
Collaborator

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will come back to this in a bit, but some initial questions/thoughts...

debug_assert_ne!(mem_refs, 0, "reference count underflow");

// if the free_list is still active (the allocator hasn't dropped) then just push the id
// TODO Weak::upgrade is a bit expensive since it clones the `Arc`, only to drop it again
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unavoidable with a reference count and no separate sync mechanism (e.g., crossbeam-epoch/RCU, global lock, etc.) -- you're fundamentally acquiring and then freeing a lock here on the memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I guess considering the alternatives, it's probably as cheap as it'll get.

let inner = desc.inner();
let addr = unsafe { &mut *inner.address.as_ptr() };
let capacity = inner.capacity() as usize;
let data = unsafe { core::slice::from_raw_parts_mut(inner.data.as_ptr(), capacity) };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming the buffer is always fully initialized upfront?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will call that out in implementation requirements.

@camshaft camshaft force-pushed the camshaft/dc-packet-pool branch 3 times, most recently from eb171ad to 19a5fe3 Compare February 27, 2025 17:13
@camshaft camshaft force-pushed the camshaft/dc-packet-pool branch 2 times, most recently from f50a0ea to 7e6b694 Compare February 28, 2025 22:11
let packets = {
// TODO use `packet.repeat(packet_count)` once stable
// https://doc.rust-lang.org/stable/core/alloc/struct.Layout.html#method.repeat
Layout::from_size_align(packet.size() * packet_count, packet.align()).unwrap()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .checked_mul(...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep i can fix that

references: AtomicUsize,
free_list: Weak<dyn FreeList>,
#[allow(dead_code)]
region: Box<dyn 'static + Send + Sync>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just Region, right? Why are we putting it in a Box?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to support both the pool implementation as well as a future AF_XDP UMEM allocation. So i just needed something that is droppable once Memory is freed.

.memory
.as_ref()
.references
.fetch_add(1, Ordering::Relaxed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this points at something being wrong with our ownership model. I think this is because the "inert" Descriptors in the free list currently conceptually don't own the memory, but I think that's the wrong way to do this, and introduces extra cost (including reference count contention on this shared counter).

Instead I would suggest we shape our ownership like this:

  • Free list strongly owns Vec<Unallocated>
    • Each Unfilled contains Arc<Memory>.
    • On Drop of the free list, Unallocated's Drop drops the Arc, letting the backing Memory get released
  • On allocation from the free list, we:
    • create an Unfilled which (1) continues to strongly own the Unallocated's Arc reference to memory (no updates there) and (2) has its own internal reference count for when it needs to move back into the free list (descriptor.references)

Drop impls:

  • Free: no custom Drop
  • Unallocated: no custom Drop, will drop its inner memory Arc, which may eventually release Memory. We ideally would structure the struct such that the drop of Memory Arc is last in drop order, and probably put an UnsafeCell around the non-Arc parts1.
  • Unfilled: decrement local reference count, if zero utilize strong memory reference to add back to free list -- probably stick the free list in the Memory block for simplicity?
  • Filled: same as Unfilled

(IIUC, Unfilled is actually always unique, so we could just unilaterally move to the free list in its Drop impl)

If I'm following the structure right, that should eliminate all reference count updates to the Memory Arc -- which makes sense, since we have a fixed quantity of things that own it and they're immediately created alongside it.

Footnotes

  1. per UnsafeCell docs: "given an &T, any part of it that is inside an UnsafeCell<_> may be deallocated during the lifetime of the reference, after the last time the reference is used (dereferenced or reborrowed). Since you cannot deallocate a part of what a reference points to, this means the memory an &T points to can be deallocated only if every part of it (including padding) is inside an UnsafeCell." We need this because otherwise deallocating the Arc eats memory out from under ourselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think i'm following. Let me try to refactor and see how it comes out :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I got it pretty close to what you're describing. The free list now owns all of the memory and all of the descriptors have strong references to Arc<dyn FreeList>, which eliminates the Weak upgrade, while putting the ownership on releasing everything on the FreeList implementation.


self.queue.rotate_left(1);
rotate_count += 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is looking for the first pool which returns an entry on alloc(), and then moving it to the front of the list to (theoretically) speed up subsequent searches. That rests on an assumption that the most recently touched list is the one that's going to get the entry returned soonest, which doesn't feel all that plausible to me? I'd expect sort of the reverse property to be true.

In either case, it seems wasteful to be rotating the queue incrementally -- why not just swap the first and the empty element? Is there some goal with preserving the order of initial allocation of the pools?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a good question. I think it might be possible to have a single free list backed by multiple memory regions, which would avoid probing for a free descriptor.

@camshaft camshaft force-pushed the camshaft/dc-packet-pool branch from 7e6b694 to 27bed3f Compare March 5, 2025 01:47
Comment on lines 30 to 33
/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However,
/// it doesn't use its own allocation for the Arc layout, and instead embeds the reference
/// counts in the descriptor data. This avoids allocating a new `Arc` every time a packet
/// is received and instead allows the descriptor to be reused.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However,
/// it doesn't use its own allocation for the Arc layout, and instead embeds the reference
/// counts in the descriptor data. This avoids allocating a new `Arc` every time a packet
/// is received and instead allows the descriptor to be reused.
/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However,
/// unlike Arc which frees back to the global allocator, a Descriptor deallocates into
/// the backing `FreeList`.

Or something like that? I debated mentioning that in the future we could probably use custom allocators (Arc<DescriptorInner, Pool>) but probably not worth doing that.

#[inline]
pub(super) unsafe fn drop_in_place(&self) {
let inner = self.inner();
Arc::decrement_strong_count(Arc::as_ptr(&inner.free_list));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I was imagining this would do something like ptr::drop_in_place(self.ptr) -- i.e., we'd just Drop the DescriptorInner itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's definitely more future proof. I'll change it

}

#[inline]
fn upgrade(&self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn upgrade(&self) {
/// SAFETY: Must be called on a currently deallocated (unfilled?) descriptor, this moves the type state into "filled" state.
unsafe fn set_filled(&self) {

Perhaps? I was thinking we'd actually change the type here (fn to_filled(self: Unfilled) -> Filled) to encode this into the types... essentially make this Filled::new and consume the descriptor there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that's a better interface. I'll fix it

@camshaft camshaft force-pushed the camshaft/dc-packet-pool branch from 27bed3f to f03773e Compare March 5, 2025 17:25
@camshaft camshaft requested a review from Mark-Simulacrum March 5, 2025 17:25
Co-authored-by: Mark Rousskov <thismark@amazon.com>
Mark-Simulacrum
Mark-Simulacrum previously approved these changes Mar 5, 2025
@camshaft camshaft enabled auto-merge (squash) March 5, 2025 17:34
@camshaft camshaft merged commit a9e7673 into main Mar 5, 2025
129 of 130 checks passed
@camshaft camshaft deleted the camshaft/dc-packet-pool branch March 5, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants