prov/shm: proposal for new shm architecture #10693

aingerson · 2025-01-11T02:55:50Z

This is an early look at what I am planning for shm. I am out on vacation for the next 2 weeks and will not be updating this PR but will check back in for any conversations or questions that come up. I'm dropping this here to get eyes on it while I'm gone and to get some feedback on it. Thanks in advance!

For context:
We've been working on rearchitecting the provider for a while (by we I mean Intel, AWS, and ORNL) because of various limitations and difficulties with the existing protocol. sm2 was an attempt at implementing some new methods at handling rendezvous - type protocols but full performance was never achieved and development was abandoned in favor of some smaller optimizations in the existing provider. However, this did not solve some of the limitations regarding receiver side resources and the polling method for the response queue (for example #9853).
This draft PR is a proposal for redoing shm that solves these issues while preserving/improving the performance.

This is a very rough draft and is NOT intended to be reviewed line by line (please don't, it will be a waste of your time). It is not polished. The commits also do not build separately but I separated them into smaller chunks to make it easier to follow. The big one is "new shm" which implements the new queues.

Here's the general gist though:
The command queue is kept as is since it has shown to be performant and allows us to implement an easy inline protocol. The queue now has a ptr to the command being used and a built in command. This built in command is only used for the inline protocol. For all other protocols, the ptr will be set to a sender size command (located in a stack in the shm region) but translated into peer space.
Inline messages do not have to be returned - all data for the command is saved within the inline command
Inject, iov, and sar messages (mmap is removed) all use a sender size command which needs to be returned by the receiver. To return the command, the receiver translates the command pointer into a sender command and inserts it into another atomic queue dedicated for returned entries. There are two atomic queues to poll but this doesn't result in more overhead and saves on assumptions regarding insertion into the return queue (since it runs in parallel with the command stack, we are always guaranteed space in the queue and do not have to handle retries).
Sar messages do not need a sar list to poll (like before) since now the return and resubmission of the command acts as the trigger for more data.
Sender side commands can be held onto by the receiver and returned out of order of the submission order which means subsequent messages (specifically iovs) do not get blocked before of previous messages not getting matched.
This also opens the door to adding a CMA-IPC fallback protocol (not in this patch set, will come later).

Like I mentioned, this is a draft and not final. I'm opening this up to get feedback and more eyes on the new implementation to see if there are any expected issues to moving to a model like this.

The PR in its current state passes all fabtests, ubertest (all.test and verify.test), and works with OMPI using lnx (shm+tcp;ofi_rxm and shm+verbs;ofi_rxm) so I feel confident enough that it's in a solid enough state to have more folks run tests on it and get some more performance and functional data from it. This shouldn't affect anyone but it is currently not working for DSA but the implementation logic is all there.

@shijin-aws @a-szegel @amirshehataornl please take a look if you can and give me a good sanity check. Any performance data (or simple a thumbs up or down) would be much appreciated!

My least favorite things about the implementation are the DSA status triggering and the SAR overflow list. I'm trying to figure out a smooth way for the SAR progress to get triggered through the command queue without the potential for the insert to fail. My vision would require modification to the atomic queue code so I'm putting it off for now to just get something working.

Performance wise, I'm seeing inline, CMA, and SAR looking stable and competitive and significant improvement with inject (inject spin lock was removed)

Let me know what you think and if this is a direction you want to move towards!

@zachdworkin @alex-mckinley

aingerson · 2025-02-04T20:59:49Z

new_shm.pptx
Attaching slides I presented today

shefty

some random comments based on tracing through the code. I'll post architecture questions to the PR

prov/shm/src/smr.h

include/ofi.h

prov/shm/src/smr.h

shefty · 2025-02-04T19:30:31Z

prov/shm/src/smr.h

+#define SMR_FLAG_HMEM_ENABLED	(1 << 3)
+
+//shm region defines
+#define SMR_CMD_SIZE		440	/* align with 64-byte cache line */


440 / 64 = 6.875. Some additional clarification on the size is needed given the current comment.

shefty · 2025-02-04T19:35:38Z

prov/shm/src/smr.h

-	struct smr_pend_entry *sar_entry;
+	struct smr_cmd *cmd;
+	struct smr_cmd cmd_cpy;
+	char msg[SMR_MSG_DATA_LEN];


I like that smr occasionally aligns structure fields, but not always. Helps to keep one on their toes.

shefty · 2025-02-04T20:41:39Z

prov/shm/src/smr.h

+			struct ofi_mr_entry	*ipc_entry;
+			ofi_hmem_async_event_t	async_event;
+		} rx;
+	};


Don't need a union with only 1 member.

While I'm here, the name 'pend' doesn't really convey anything about what this structure is storing or how to use it.

shefty · 2025-02-04T20:44:02Z

prov/shm/src/smr.h

-		return (ep->region->cma_cap_peer == SMR_VMA_CAP_ON ||
-			peer_smr->xpmem_cap_self == SMR_VMA_CAP_ON);
+	return ep->region == peer_smr ? ep->region->self_vma_caps :
+					ep->region->peer_vma_caps;


These caps look like flags, but the entire set is being returned as a boolean. That doesn't look right.

prov/shm/src/smr_ep.c

shefty · 2025-02-04T21:33:48Z

For unexpected messages (both new and current arch), do those hold a command slot until the message is posted by the receiver? That is, do unexpected messages result in a form of starvation?

aingerson · 2025-02-04T22:08:49Z

@shefty In the old architecture, unexpected messages didn't hold a slot in the command queue but they did hold a slot in the response queue (for large/delivery complete messages), causing starvation. In the new architecture, they still don't hold a spot in the command queue and since the response/return queue is all completed transfers, it is not blocked anymore. For CMA messages, we don't offload the unexpected message so the send side command is still consumed for the lifetime of the transfer but it's not blocking progress. The resources is, however, limited by the tx_size. So you wouldn't be able to send more than tx_size unexpected large messages. Do you think that would be a significant issue?

shefty · 2025-02-04T22:21:07Z

I don't think it would be an issue unless the tx_size is small. It looks like the default size is 1024, for all peers, which I still would hope would be large enough. I don't remember what the MPICH tests require. Is the size user adjustable?

aingerson · 2025-02-04T22:38:10Z

I believe the MPICH tests require 64 so I haven't encountered any issues so far. The size is adjustable so they can always increase. @shijin-aws said they have an application that is hitting this issue with OMPI. Their current fix is to increase the tx_size which fixes it but causes a performance issue. I've asked him to get some more information on whether this is coming from the application or OMPI and what the use case really requires.

ZE IPC code protocol was updated to remove dependency on Unix socket code - can be removed Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

Remove dates from Intel copyright (no longer recommended) Remove unneeded headers from .c and .h files Fix ifdef name for headers Put headers in "" or <> depending on location Organize headers in the following order: - corresponding .h file - other shm headers - ofi headers - system headers - Within each group, organize alphabetically Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

Add helper functions to freestack implementation: - smr_freestack_avail: return the number of available elements - smr_freestack_get_index: return the index number of the given element Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

Allow use of mr copy function using direction Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

Add function to return minimum of 3 values Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

xpmem capability can only have 2 settings - on or off. Turn into bool for simplicity Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

create function needs to align the allocation with the cache line size Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

Replacement of shm protocols with new architecture. Significant changes: - Turn response queue into return queue for local commands. Inline commands are still receive side. All commands have an inline option but a common ptr to the command being used for remote commands. These commands have to be returned to the sender but the receive side can hold onto them as long as needed for the lifetime of the message - shm has self and peer caps for each p2p interface (right now just CMA and xpmem). The support for each of these interfaces is saved in separate fields which causes a lot of wasted memory and is confusing. This merges these into two fields (one for self and one for peer) which holds the information for all p2p interfaces and is accessed by the P2P type enums. CMA also needs a flag to know wether CMA support has been queried yet or not. - Move some shm fields around for alignment - Simplifies access to the map to remove need for container - There is a 1:1 relationship with the av and map so just reuse the util av lock for access to the map as well. This requires some reorganizing of the locking semantics - There is nothing in smr_fabric. Remove and just use the util_fabric directly - Just like on the send side, make the progress functions be an array of function pointers accessible by the command proto. This cleans up the parameters of the progress calls and streamlines the calls - Merge tx and pend entries for simple management of pending operations - Redefinition of cmd and header for simplicty and easier reading. Also removes and adds fields for new architecture - Refactor async ipc list and turn it into a generic async list to track asynchronous copies which can be used for any accelerator (GPU or DSA) that copies locally asynchronously. - Cleanup naming and organization for readibility. Shorten some names to help with line length and organization - Fix weird header dependency smr_util.c->smr.h->smr_util.h so that smr_util.c is only dependent on smr_util.h and is isolated to solely shm region and protocol definitions. This separates the shm utilities from being dependent on the provider leaving the door open for reuse of the shm utilities if needed Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

aingerson added ⚠️ Do not merge work in progress labels Jan 11, 2025

aingerson marked this pull request as draft January 11, 2025 02:56

shefty reviewed Feb 4, 2025

View reviewed changes

aingerson force-pushed the shm_new_draft branch 3 times, most recently from 5ef42d3 to 25ff301 Compare February 12, 2025 23:42

aingerson marked this pull request as ready for review February 12, 2025 23:43

aingerson force-pushed the shm_new_draft branch from 25ff301 to 972f9c1 Compare February 13, 2025 18:55

aingerson added 8 commits February 18, 2025 10:26

prov/shm: remove socket code, no longer needed

db65376

ZE IPC code protocol was updated to remove dependency on Unix socket code - can be removed Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

include/ofi_hmem: make ofi_copy_mr_iov non-static

101ca62

Allow use of mr copy function using direction Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

include/ofi.h: add MIN3 function

aa3d3fb

Add function to return minimum of 3 values Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

include/ofi_xpmem: change cap into bool

c9c61a2

xpmem capability can only have 2 settings - on or off. Turn into bool for simplicity Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

include/ofi_atomic_queue: fix create function

11b5415

create function needs to align the allocation with the cache line size Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>

aingerson force-pushed the shm_new_draft branch 2 times, most recently from e9dfff4 to b4ad6d4 Compare February 18, 2025 23:04

aingerson closed this Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/shm: proposal for new shm architecture #10693

prov/shm: proposal for new shm architecture #10693

aingerson commented Jan 11, 2025

aingerson commented Feb 4, 2025

shefty left a comment

shefty Feb 4, 2025

shefty Feb 4, 2025

shefty Feb 4, 2025

shefty Feb 4, 2025

shefty commented Feb 4, 2025

aingerson commented Feb 4, 2025

shefty commented Feb 4, 2025

aingerson commented Feb 4, 2025

prov/shm: proposal for new shm architecture #10693

prov/shm: proposal for new shm architecture #10693

Conversation

aingerson commented Jan 11, 2025

aingerson commented Feb 4, 2025

shefty left a comment

Choose a reason for hiding this comment

shefty Feb 4, 2025

Choose a reason for hiding this comment

shefty Feb 4, 2025

Choose a reason for hiding this comment

shefty Feb 4, 2025

Choose a reason for hiding this comment

shefty Feb 4, 2025

Choose a reason for hiding this comment

shefty commented Feb 4, 2025

aingerson commented Feb 4, 2025

shefty commented Feb 4, 2025

aingerson commented Feb 4, 2025