feat(s2n-quic-dc): implement recv path packet pool #2483

camshaft · 2025-02-24T22:46:19Z

Description of changes:

The current recv buffer code relys on the s2n_quic_dc::msg::recv::Message module. This works well for an implementation that owns all of the messages that come in on a single socket. However, once we start getting into multiplexing multiple receivers on a single socket, the Message struct doesn't really enable that usecase, at least not efficiently.

We also run into issues if we ever want to support AF_XDP (since the XDP UMEM wants a contiguous region of packets), GRO (since we can't easily split Messages across packet boundaries without a copy), or dedicated recv tasks/threads that dispatch to the registered channels.

This implementation adds a new socket::recv::pool module, which enables all of these use cases. It works by passing around Descriptor pointers that point back to a region of memory (which can easily support AF_XDP). On drop, the descriptors get put back into a free list to make sure we don't leak segments. Additionally, the implementation will return a Segments iterator, which cheaply splits received packets up into GRO segments, which can be sent to distinct tasks/threads, without worrying about synchronizing those regions.

I've also included an example implementation of a worker that loops and receives packets from a blocking UDP socket and dispatches them using the new Router trait. The final implementation will probably be a bit more complicated - it probably needs to support shutting down, busy polling, etc, but it's a starting point to run tests.

Speaking of tests, I've got a test in a branch that sends and receives packets in a loop using this new allocator. Over localhost, I was able to do about 450 Gbps with GSO/GRO. The descriptor free code accounts for about 1% CPU in this example so it's possible it could be slightly improved but I think this is an OK start.

Testing:

I've included the model_test, which was run under miri and bolero with ASAN to try and catch any issues around safety violations. I did initially run into a few issues with leaks and stacked borrow violations, but have addressed all of those issues with this change. The model test is quite comprehensive on all of the operations that can happen with the pool, though it does not currently check for atomic issues, since we don't have loom set up in the s2n-quic-dc crate. That being said, everything that deals with atomics links to the std::sync::Arc code, to justify why it's there.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Mark-Simulacrum · 2025-02-25T16:17:15Z

Over localhost, I was able to do about 450 Gbps with GSO/GRO.

That bandwidth feels very close to (main) memory bandwidth (probably ~100 GB/second theoretical) -- and you're hitting 57 GB/second. So that seems good, though a little surprising to me... historically a single UDP socket has peaked at ~300k packets/second in my testing which is around 300_000*(2**16)*8/1000/1000/1000 = 157.2864 Gbps assuming "perfect" 65k packets.

camshaft · 2025-02-25T16:35:23Z

historically a single UDP socket has peaked at ~300k packets/second

That's probably the difference here - I had a socket per core and reuse port on the receiver. I should have mentioned that in my description.

Mark-Simulacrum

Will come back to this in a bit, but some initial questions/thoughts...

Mark-Simulacrum · 2025-02-25T16:29:03Z

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

+        debug_assert_ne!(mem_refs, 0, "reference count underflow");
+
+        // if the free_list is still active (the allocator hasn't dropped) then just push the id
+        // TODO Weak::upgrade is a bit expensive since it clones the `Arc`, only to drop it again


This seems unavoidable with a reference count and no separate sync mechanism (e.g., crossbeam-epoch/RCU, global lock, etc.) -- you're fundamentally acquiring and then freeing a lock here on the memory.

Yeah that makes sense. I guess considering the alternatives, it's probably as cheap as it'll get.

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

Mark-Simulacrum · 2025-02-25T16:38:26Z

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

+        let inner = desc.inner();
+        let addr = unsafe { &mut *inner.address.as_ptr() };
+        let capacity = inner.capacity() as usize;
+        let data = unsafe { core::slice::from_raw_parts_mut(inner.data.as_ptr(), capacity) };


Are we assuming the buffer is always fully initialized upfront?

Yes. I will call that out in implementation requirements.

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

Mark-Simulacrum · 2025-02-25T21:09:47Z

dc/s2n-quic-dc/src/socket/recv/pool.rs

+        let packets = {
+            // TODO use `packet.repeat(packet_count)` once stable
+            // https://doc.rust-lang.org/stable/core/alloc/struct.Layout.html#method.repeat
+            Layout::from_size_align(packet.size() * packet_count, packet.align()).unwrap()


nit: .checked_mul(...)?

yep i can fix that

Mark-Simulacrum · 2025-03-03T16:15:28Z

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

+    references: AtomicUsize,
+    free_list: Weak<dyn FreeList>,
+    #[allow(dead_code)]
+    region: Box<dyn 'static + Send + Sync>,


This is just Region, right? Why are we putting it in a Box?

This is to support both the pool implementation as well as a future AF_XDP UMEM allocation. So i just needed something that is droppable once Memory is freed.

Mark-Simulacrum · 2025-03-03T16:40:00Z

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

+                .memory
+                .as_ref()
+                .references
+                .fetch_add(1, Ordering::Relaxed);


IMO, this points at something being wrong with our ownership model. I think this is because the "inert" Descriptors in the free list currently conceptually don't own the memory, but I think that's the wrong way to do this, and introduces extra cost (including reference count contention on this shared counter).

Instead I would suggest we shape our ownership like this:

Free list strongly owns Vec<Unallocated>

Each Unfilled contains Arc<Memory>.

On Drop of the free list, Unallocated's Drop drops the Arc, letting the backing Memory get released

On allocation from the free list, we:

create an Unfilled which (1) continues to strongly own the Unallocated's Arc reference to memory (no updates there) and (2) has its own internal reference count for when it needs to move back into the free list (descriptor.references)

Drop impls:

Free: no custom Drop

Unallocated: no custom Drop, will drop its inner memory Arc, which may eventually release Memory. We ideally would structure the struct such that the drop of Memory Arc is last in drop order, and probably put an UnsafeCell around the non-Arc parts¹.

Unfilled: decrement local reference count, if zero utilize strong memory reference to add back to free list -- probably stick the free list in the Memory block for simplicity?

Filled: same as Unfilled

(IIUC, Unfilled is actually always unique, so we could just unilaterally move to the free list in its Drop impl)

If I'm following the structure right, that should eliminate all reference count updates to the Memory Arc -- which makes sense, since we have a fixed quantity of things that own it and they're immediately created alongside it.

Footnotes

per UnsafeCell docs: "given an &T, any part of it that is inside an UnsafeCell<_> may be deallocated during the lifetime of the reference, after the last time the reference is used (dereferenced or reborrowed). Since you cannot deallocate a part of what a reference points to, this means the memory an &T points to can be deallocated only if every part of it (including padding) is inside an UnsafeCell." We need this because otherwise deallocating the Arc eats memory out from under ourselves. ↩

Hmm I think i'm following. Let me try to refactor and see how it comes out :)

Ok I think I got it pretty close to what you're describing. The free list now owns all of the memory and all of the descriptors have strong references to Arc<dyn FreeList>, which eliminates the Weak upgrade, while putting the ownership on releasing everything on the FreeList implementation.

Mark-Simulacrum · 2025-03-03T16:45:47Z

dc/s2n-quic-dc/src/socket/recv/udp.rs

+
+            self.queue.rotate_left(1);
+            rotate_count += 1;
+        }


IIUC, this is looking for the first pool which returns an entry on alloc(), and then moving it to the front of the list to (theoretically) speed up subsequent searches. That rests on an assumption that the most recently touched list is the one that's going to get the entry returned soonest, which doesn't feel all that plausible to me? I'd expect sort of the reverse property to be true.

In either case, it seems wasteful to be rotating the queue incrementally -- why not just swap the first and the empty element? Is there some goal with preserving the order of initial allocation of the pools?

Yeah it's a good question. I think it might be possible to have a single free list backed by multiple memory regions, which would avoid probing for a free descriptor.

dc/s2n-quic-dc/src/socket/recv/pool.rs

Mark-Simulacrum · 2025-03-05T14:51:09Z

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

+/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However,
+/// it doesn't use its own allocation for the Arc layout, and instead embeds the reference
+/// counts in the descriptor data. This avoids allocating a new `Arc` every time a packet
+/// is received and instead allows the descriptor to be reused.


Suggested change

/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However,

/// it doesn't use its own allocation for the Arc layout, and instead embeds the reference

/// counts in the descriptor data. This avoids allocating a new `Arc` every time a packet

/// is received and instead allows the descriptor to be reused.

/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However,

/// unlike Arc which frees back to the global allocator, a Descriptor deallocates into

/// the backing `FreeList`.

Or something like that? I debated mentioning that in the future we could probably use custom allocators (Arc<DescriptorInner, Pool>) but probably not worth doing that.

Mark-Simulacrum · 2025-03-05T14:52:52Z

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

+    #[inline]
+    pub(super) unsafe fn drop_in_place(&self) {
+        let inner = self.inner();
+        Arc::decrement_strong_count(Arc::as_ptr(&inner.free_list));


Hm, I was imagining this would do something like ptr::drop_in_place(self.ptr) -- i.e., we'd just Drop the DescriptorInner itself.

Ah that's definitely more future proof. I'll change it

Mark-Simulacrum · 2025-03-05T14:55:08Z

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

+    }
+
+    #[inline]
+    fn upgrade(&self) {


Suggested change

fn upgrade(&self) {

/// SAFETY: Must be called on a currently deallocated (unfilled?) descriptor, this moves the type state into "filled" state.

unsafe fn set_filled(&self) {

Perhaps? I was thinking we'd actually change the type here (fn to_filled(self: Unfilled) -> Filled) to encode this into the types... essentially make this Filled::new and consume the descriptor there.

Agreed that's a better interface. I'll fix it

dc/s2n-quic-dc/src/socket/recv/descriptor.rs

dc/s2n-quic-dc/src/socket/recv/pool.rs

Co-authored-by: Mark Rousskov <thismark@amazon.com>

camshaft force-pushed the camshaft/dc-packet-pool branch 2 times, most recently from 32464a7 to e8932f5 Compare February 24, 2025 22:59

camshaft requested a review from Mark-Simulacrum February 24, 2025 23:28

camshaft marked this pull request as ready for review February 24, 2025 23:29

camshaft force-pushed the camshaft/dc-packet-pool branch from e8932f5 to a1e1b25 Compare February 24, 2025 23:32

Mark-Simulacrum reviewed Feb 25, 2025

View reviewed changes

camshaft force-pushed the camshaft/dc-packet-pool branch 3 times, most recently from eb171ad to 19a5fe3 Compare February 27, 2025 17:13

camshaft requested a review from Mark-Simulacrum February 27, 2025 17:14

camshaft force-pushed the camshaft/dc-packet-pool branch 2 times, most recently from f50a0ea to 7e6b694 Compare February 28, 2025 22:11

This was referenced Feb 28, 2025

refactor(s2n-quic-dc): add stream recv buffer trait and impls #2505

Merged

feat(s2n-quic-dc): add channel recv buffer impl #2506

Merged

Mark-Simulacrum reviewed Mar 3, 2025

View reviewed changes

camshaft force-pushed the camshaft/dc-packet-pool branch from 7e6b694 to 27bed3f Compare March 5, 2025 01:47

Mark-Simulacrum reviewed Mar 5, 2025

View reviewed changes

camshaft added 4 commits March 5, 2025 09:55

feat(s2n-quic-dc): implement recv path packet pool

8881526

add more comments/docs

1b23e70

simplify memory ownership

c015648

make more functions unsafe

f03773e

camshaft force-pushed the camshaft/dc-packet-pool branch from 27bed3f to f03773e Compare March 5, 2025 17:25

camshaft requested a review from Mark-Simulacrum March 5, 2025 17:25

Update dc/s2n-quic-dc/src/socket/recv/descriptor.rs

41bbeee

Co-authored-by: Mark Rousskov <thismark@amazon.com>

Mark-Simulacrum previously approved these changes Mar 5, 2025

View reviewed changes

camshaft enabled auto-merge (squash) March 5, 2025 17:34

make clippy happy

2b8bdab

camshaft dismissed Mark-Simulacrum’s stale review via 2b8bdab March 5, 2025 17:39

Mark-Simulacrum approved these changes Mar 5, 2025

View reviewed changes

camshaft merged commit a9e7673 into main Mar 5, 2025
129 of 130 checks passed

camshaft deleted the camshaft/dc-packet-pool branch March 5, 2025 18:03

camshaft mentioned this pull request Mar 6, 2025

feat(s2n-quic-dc): implement queue allocator/dispatcher #2517

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(s2n-quic-dc): implement recv path packet pool #2483

feat(s2n-quic-dc): implement recv path packet pool #2483

camshaft commented Feb 24, 2025 •

edited

Loading

Mark-Simulacrum commented Feb 25, 2025

camshaft commented Feb 25, 2025

Mark-Simulacrum left a comment

Mark-Simulacrum Feb 25, 2025

camshaft Feb 26, 2025

Mark-Simulacrum Feb 25, 2025

camshaft Feb 25, 2025

Mark-Simulacrum Feb 25, 2025

camshaft Mar 4, 2025

Mark-Simulacrum Mar 3, 2025

camshaft Mar 4, 2025

Mark-Simulacrum Mar 3, 2025

camshaft Mar 4, 2025

camshaft Mar 5, 2025

Mark-Simulacrum Mar 3, 2025

camshaft Mar 4, 2025

Mark-Simulacrum Mar 5, 2025

Mark-Simulacrum Mar 5, 2025

camshaft Mar 5, 2025

Mark-Simulacrum Mar 5, 2025

camshaft Mar 5, 2025

	fn upgrade(&self) {
	/// SAFETY: Must be called on a currently deallocated (unfilled?) descriptor, this moves the type state into "filled" state.
	unsafe fn set_filled(&self) {

feat(s2n-quic-dc): implement recv path packet pool #2483

feat(s2n-quic-dc): implement recv path packet pool #2483

Conversation

camshaft commented Feb 24, 2025 • edited Loading

Description of changes:

Testing:

Mark-Simulacrum commented Feb 25, 2025

camshaft commented Feb 25, 2025

Mark-Simulacrum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camshaft commented Feb 24, 2025 •

edited

Loading