-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(s2n-quic-dc): implement recv path packet pool #2483
Conversation
32464a7
to
e8932f5
Compare
e8932f5
to
a1e1b25
Compare
That bandwidth feels very close to (main) memory bandwidth (probably ~100 GB/second theoretical) -- and you're hitting 57 GB/second. So that seems good, though a little surprising to me... historically a single UDP socket has peaked at ~300k packets/second in my testing which is around |
That's probably the difference here - I had a socket per core and reuse port on the receiver. I should have mentioned that in my description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will come back to this in a bit, but some initial questions/thoughts...
debug_assert_ne!(mem_refs, 0, "reference count underflow"); | ||
|
||
// if the free_list is still active (the allocator hasn't dropped) then just push the id | ||
// TODO Weak::upgrade is a bit expensive since it clones the `Arc`, only to drop it again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unavoidable with a reference count and no separate sync mechanism (e.g., crossbeam-epoch/RCU, global lock, etc.) -- you're fundamentally acquiring and then freeing a lock here on the memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. I guess considering the alternatives, it's probably as cheap as it'll get.
let inner = desc.inner(); | ||
let addr = unsafe { &mut *inner.address.as_ptr() }; | ||
let capacity = inner.capacity() as usize; | ||
let data = unsafe { core::slice::from_raw_parts_mut(inner.data.as_ptr(), capacity) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming the buffer is always fully initialized upfront?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will call that out in implementation requirements.
eb171ad
to
19a5fe3
Compare
f50a0ea
to
7e6b694
Compare
let packets = { | ||
// TODO use `packet.repeat(packet_count)` once stable | ||
// https://doc.rust-lang.org/stable/core/alloc/struct.Layout.html#method.repeat | ||
Layout::from_size_align(packet.size() * packet_count, packet.align()).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: .checked_mul(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep i can fix that
references: AtomicUsize, | ||
free_list: Weak<dyn FreeList>, | ||
#[allow(dead_code)] | ||
region: Box<dyn 'static + Send + Sync>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just Region
, right? Why are we putting it in a Box?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to support both the pool
implementation as well as a future AF_XDP
UMEM allocation. So i just needed something that is droppable once Memory
is freed.
.memory | ||
.as_ref() | ||
.references | ||
.fetch_add(1, Ordering::Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this points at something being wrong with our ownership model. I think this is because the "inert" Descriptors in the free list currently conceptually don't own the memory, but I think that's the wrong way to do this, and introduces extra cost (including reference count contention on this shared counter).
Instead I would suggest we shape our ownership like this:
- Free list strongly owns
Vec<Unallocated>
- Each
Unfilled
containsArc<Memory>
. - On Drop of the free list,
Unallocated
's Drop drops the Arc, letting the backing Memory get released
- Each
- On allocation from the free list, we:
- create an Unfilled which (1) continues to strongly own the Unallocated's Arc reference to memory (no updates there) and (2) has its own internal reference count for when it needs to move back into the free list (
descriptor.references
)
- create an Unfilled which (1) continues to strongly own the Unallocated's Arc reference to memory (no updates there) and (2) has its own internal reference count for when it needs to move back into the free list (
Drop impls:
Free
: no custom DropUnallocated
: no custom Drop, will drop its inner memory Arc, which may eventually releaseMemory
. We ideally would structure the struct such that the drop of Memory Arc is last in drop order, and probably put an UnsafeCell around the non-Arc parts1.Unfilled
: decrement local reference count, if zero utilize strong memory reference to add back to free list -- probably stick the free list in the Memory block for simplicity?Filled
: same as Unfilled
(IIUC, Unfilled is actually always unique, so we could just unilaterally move to the free list in its Drop impl)
If I'm following the structure right, that should eliminate all reference count updates to the Memory Arc -- which makes sense, since we have a fixed quantity of things that own it and they're immediately created alongside it.
Footnotes
-
per
UnsafeCell
docs: "given an &T, any part of it that is inside an UnsafeCell<_> may be deallocated during the lifetime of the reference, after the last time the reference is used (dereferenced or reborrowed). Since you cannot deallocate a part of what a reference points to, this means the memory an &T points to can be deallocated only if every part of it (including padding) is inside an UnsafeCell." We need this because otherwise deallocating the Arc eats memory out from under ourselves. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think i'm following. Let me try to refactor and see how it comes out :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think I got it pretty close to what you're describing. The free list now owns all of the memory and all of the descriptors have strong references to Arc<dyn FreeList>
, which eliminates the Weak
upgrade, while putting the ownership on releasing everything on the FreeList implementation.
|
||
self.queue.rotate_left(1); | ||
rotate_count += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this is looking for the first pool which returns an entry on alloc(), and then moving it to the front of the list to (theoretically) speed up subsequent searches. That rests on an assumption that the most recently touched list is the one that's going to get the entry returned soonest, which doesn't feel all that plausible to me? I'd expect sort of the reverse property to be true.
In either case, it seems wasteful to be rotating the queue incrementally -- why not just swap the first and the empty element? Is there some goal with preserving the order of initial allocation of the pools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a good question. I think it might be possible to have a single free list backed by multiple memory regions, which would avoid probing for a free descriptor.
7e6b694
to
27bed3f
Compare
/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However, | ||
/// it doesn't use its own allocation for the Arc layout, and instead embeds the reference | ||
/// counts in the descriptor data. This avoids allocating a new `Arc` every time a packet | ||
/// is received and instead allows the descriptor to be reused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However, | |
/// it doesn't use its own allocation for the Arc layout, and instead embeds the reference | |
/// counts in the descriptor data. This avoids allocating a new `Arc` every time a packet | |
/// is received and instead allows the descriptor to be reused. | |
/// Fundamentally, this is similar to something like `Arc<DescriptorInner>`. However, | |
/// unlike Arc which frees back to the global allocator, a Descriptor deallocates into | |
/// the backing `FreeList`. |
Or something like that? I debated mentioning that in the future we could probably use custom allocators (Arc<DescriptorInner, Pool>
) but probably not worth doing that.
#[inline] | ||
pub(super) unsafe fn drop_in_place(&self) { | ||
let inner = self.inner(); | ||
Arc::decrement_strong_count(Arc::as_ptr(&inner.free_list)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I was imagining this would do something like ptr::drop_in_place(self.ptr)
-- i.e., we'd just Drop the DescriptorInner itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's definitely more future proof. I'll change it
} | ||
|
||
#[inline] | ||
fn upgrade(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn upgrade(&self) { | |
/// SAFETY: Must be called on a currently deallocated (unfilled?) descriptor, this moves the type state into "filled" state. | |
unsafe fn set_filled(&self) { |
Perhaps? I was thinking we'd actually change the type here (fn to_filled(self: Unfilled) -> Filled
) to encode this into the types... essentially make this Filled::new
and consume the descriptor there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that's a better interface. I'll fix it
27bed3f
to
f03773e
Compare
Co-authored-by: Mark Rousskov <thismark@amazon.com>
Description of changes:
The current recv buffer code relys on the s2n_quic_dc::msg::recv::Message module. This works well for an implementation that owns all of the messages that come in on a single socket. However, once we start getting into multiplexing multiple receivers on a single socket, the
Message
struct doesn't really enable that usecase, at least not efficiently.We also run into issues if we ever want to support AF_XDP (since the XDP UMEM wants a contiguous region of packets), GRO (since we can't easily split
Message
s across packet boundaries without a copy), or dedicatedrecv
tasks/threads that dispatch to the registered channels.This implementation adds a new
socket::recv::pool
module, which enables all of these use cases. It works by passing aroundDescriptor
pointers that point back to a region of memory (which can easily support AF_XDP). On drop, the descriptors get put back into a free list to make sure we don't leak segments. Additionally, the implementation will return aSegments
iterator, which cheaply splits received packets up into GRO segments, which can be sent to distinct tasks/threads, without worrying about synchronizing those regions.I've also included an example implementation of a worker that loops and receives packets from a blocking UDP socket and dispatches them using the new
Router
trait. The final implementation will probably be a bit more complicated - it probably needs to support shutting down, busy polling, etc, but it's a starting point to run tests.Speaking of tests, I've got a test in a branch that sends and receives packets in a loop using this new allocator. Over localhost, I was able to do about 450 Gbps with GSO/GRO. The descriptor free code accounts for about 1% CPU in this example so it's possible it could be slightly improved but I think this is an OK start.
Testing:
I've included the
model_test
, which was run undermiri
andbolero
with ASAN to try and catch any issues around safety violations. I did initially run into a few issues with leaks and stacked borrow violations, but have addressed all of those issues with this change. The model test is quite comprehensive on all of the operations that can happen with the pool, though it does not currently check for atomic issues, since we don't have loom set up in thes2n-quic-dc
crate. That being said, everything that deals with atomics links to thestd::sync::Arc
code, to justify why it's there.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.