prov/tcp: drop stale messages for new endpoint #10783

jxiong · 2025-02-11T23:43:05Z

When an communication instance is restarted, the new instance may still receive 'stale' tagged replies that belong to the previous instance. In the current implementation, it will disable the polling against the new endpoint and then leave lots of normal messages unprocessed.

This PR is trying to fix the issue by dropping those messages with unmatched tag and if the endpoint is entirely new.

DAOS ticket: https://daosio.atlassian.net/browse/DAOS-17103

jxiong · 2025-02-11T23:44:56Z

@shefty I just want to bring this PR into your attention. It can fix the problem on our end but I'm not sure if this is a proper fix. I'm open to suggestions.

@frostedcmos @ooststep @soumagne FYI

shefty · 2025-02-12T00:18:48Z

Tagged messages can invoke a 3-way handshake:

    xnet_op_tag_rts ->
        <- xnet_op_cts
    xnet_op_data ->

Is the problem you're reporting that the sender of the cts restarts, such that when 'data' arrives, it doesn't match a receive anymore?

In the eager message protocol, tagged messages are just sent using xnet_op_tag. There is no reply (except for delivery complete, which generates an ACK).

The patch modifies xnet_handle_tag(), which is called when receiving either xnet_op_tag_rts or xnet_op_tag. In both cases, these show up as new messages and would take the unexpected message path.

In the case an xnet_op_cts or xnet_op_data message arrives unexpectedly, the code should trap that and report an error internally. That should tear the connection down, which is probably correct, since somehow the 2 peers aren't talking who they think they're talking with.

jxiong · 2025-02-12T00:38:11Z

Tagged messages can invoke a 3-way handshake:
    xnet_op_tag_rts ->
        <- xnet_op_cts
    xnet_op_data ->
Is the problem you're reporting that the sender of the cts restarts, such that when 'data' arrives, it doesn't match a receive anymore?

(I don't know what cts and rts stand for but probably) yes. And in one of my debug patch, I saw the previous instance called xnet_rdm_trecv() to set up a receiving buffer, and the same tag appeared in the new instance where the other end is trying to deliver a RPC reply.

In the eager message protocol, tagged messages are just sent using xnet_op_tag. There is no reply (except for delivery complete, which generates an ACK).

The reply refers to the RPC reply in terms of upper layer RPC processing.

The patch modifies xnet_handle_tag(), which is called when receiving either xnet_op_tag_rts or xnet_op_tag. In both cases, these show up as new messages and would take the unexpected message path.

In the case an xnet_op_cts or xnet_op_data message arrives unexpectedly, the code should trap that and report an error internally. That should tear the connection down, which is probably correct, since somehow the 2 peers aren't talking who they think they're talking with.

It is xnet_op_tag. The call chain is xnet_handle_tag() -> xnet_save_and_cont that returns false because fi_addr has not set yet; and then it goes here disabling polling and then get stuck there.

@soumagne feel free to chime in here

jxiong · 2025-02-12T00:45:42Z

When this happens, it will leave a lot of data in the TCP recvq and sendq, as the follows:

---------------
daos-server-3
---------------
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0 235692 10.128.0.17:60988       10.128.0.18:31316       ESTABLISHED
...
---------------
daos-server-4
---------------
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp   196530      0 10.128.0.18:31316       10.128.0.17:60988       ESTABLISHED
...

In this case, the server got restarted was daos-server-4. It will block the message processing and leave a lot of data in TCP recvq

shefty · 2025-02-12T01:10:43Z

RTS = ready to send. CTS = clear to send. RTS is a request to send. CTS is an ack that it's now okay to send.

I don't understand the issue. Was the tcp socket not torn down? The server restarted, but kept the socket between instances?

xnet_handle_tag() is processing new messages.

jxiong · 2025-02-12T01:23:40Z

RTS = ready to send. CTS = clear to send. RTS is a request to send. CTS is an ack that it's now okay to send.
...
xnet_handle_tag() is processing new messages.

@shefty thanks for looking into this.

This is the log I collected non_rescure.log.gz with this patch applied: logs.diff.gz

Please look for the message xnet_progress_rx() fd: 2257 read error: -11 in the log where 2257 is the file descriptor of the accepted socket. And you can see the entire connecting process above as:

48712 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312185] mercury->libfabric [warning] tcp:ep_ctrl:191 xnet_open_conn() conn ep: 0x14e2e4364730, addr: ffffffffffffffff, peer: 0x14e2e4095820
248713 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external DBUG # [145273.312196] mercury->libfabric [debug] tcp:ep_ctrl:278 xnet_ep_accept() accepting endpoint connection
248714 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312207] mercury->libfabric [warning] tcp:ep_ctrl:306 xnet_ep_accept() ep: 0x14e2e4364730, fd 2257 remote addr: ffffffffffffffff
248715 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312214] mercury->libfabric [warning] tcp:ep_ctrl:1777 xnet_monitor_sock() Add fd: 2257 ep from monitoring: 1, remote addr: 10.128.0.17:54818
248716 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external DBUG # [145273.312220] mercury->libfabric [debug] tcp:ep_ctrl:554 xnet_handle_event_list() event FI_CONNECTED
248717 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312293] mercury->libfabric [warning] tcp:ep_ctrl:1541 xnet_handle_events() epoll 1 events available
248718 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312301] mercury->libfabric [warning] tcp:ep_ctrl:1500 xnet_run_ep() run_ep fd: 2257 ep event avail, state: 4, event: 1/0/0/1
248719 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312306] mercury->libfabric [warning] tcp:ep_ctrl:1167 xnet_progress_rx() fd: 2257 peer addr: ffffffffffffffff
248720 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312315] mercury->libfabric [warning] tcp:ep_data:1083 xnet_progress_hdr() Received hrd op: 1
248721 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312321] mercury->libfabric [warning] tcp:ep_data:94 xnet_save_and_cont() xxxx ep: 0x14e2e4364730, ep->cur_rx.data_left = 80/-1, ffffffffffffffff
248722 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312326] mercury->libfabric [warning] tcp:ep_data:98 xnet_save_and_cont() ep: 0x14e2e4364730, ep->cur_rx.data_left = 80/-1, ffffffffffffffff
248723 02/09-15:36:41.36 daos-server-4 DAOS[165295/0/87] external WARN # [145273.312334] mercury->libfabric [warning] tcp:ep_data:863 xnet_handle_tag() Disable fd 2257 pollin

After the read error message, there is no more activities for the fd 2257

The log that proved that the tag was used in previous instance is here (I didn't save this one though):

02/10-20:58:40.73 daos-server-4 DAOS[321568/0/3] external WARN # [250992.689111] mercury->libfabric [warning] tcp:ep_ctrl:211 xnet_rdm_trecv() entered with tag = 18
...
02/10-20:59:42.42 daos-server-4 DAOS[321568/0/399] server WARN src/engine/init.c:668 dss_crt_event_cb() raising SIGKILL: exclusion of this engine (rank 1) detected
...
02/10-21:00:05.13 daos-server-4 DAOS[321896/0/87] external WARN # [251077.088342] mercury->libfabric [warning] tcp:ep_data:850 xnet_handle_tag() tag value: 18

From mercury, the tag value is started from 1 and monotonously increased. As you can see, the first message after starting has the tag value 18, which matched the tag value of last posted reciving buffer.

soumagne · 2025-02-12T02:06:03Z

Adding some context to what @jxiong is seeing. We use RDM endpoints, he has in this example 2 peers regularly pinging each other by sending RPCs (request + response). At some point one of the peers goes down while the other peer is in the process of sending an RPC response back to that peer on a specific tag value (the tag value is currently unique per RPC). When the peer goes down, it does tear down the connection as expected, but the other peer is still trying to send its response. When the peer is brought back up and the endpoint is recreated, the other peer that is still trying to send its response reconnects and the response is now seen as an unexpected message, since in this case no recv has been preposted yet as it belonged to the previous instance and there will never be any matching recv in the future either. What @jxiong was trying to do is that instead of disabling the polling, he was trying to drop the message for which no tag could be matched against so that the recv buffer would not potentially become full.
The other approach that we were trying to consider was to avoid using tag messages altogether and drop the message ourselves but that is a bit more involved at this point. Regardless I would not expect that current patch to be integrated as is, ideally maybe there could be an option to have libfabric drop unexpected recvs ? (without using FI_PEEK / FI_DISCARD which require posted recvs)

jxiong · 2025-02-12T03:48:35Z

@soumagne is right and I'd like to work with the community, especially @shefty, on a proper fix.

shefty · 2025-02-12T19:21:32Z

An option to discard any unexpected tagged message is possible, but I don't know what apps could actually use it. Unexpected messages are a way of life with tagged messages.

The problem here is that the application states are out of sync. I don't know that this can be fixed by a lower-layer network library.

Now, there may be an issue when polling on a socket is disabled. Polling is disabled when it can't make progress until the application takes some action, usually posting a receive. Leaving it enabled results in never ending poll events. If the socket is disconnected while polling is disabled, that may never be discovered. Periodically enabling polling might resolve this. (This is all hypothetical, I'm not looking at the code currently.)

ooststep · 2025-02-12T19:50:52Z

Now, there may be an issue when polling on a socket is disabled. Polling is disabled when it can't make progress until the application takes some action, usually posting a receive. Leaving it enabled results in never ending poll events. If the socket is disconnected while polling is disabled, that may never be discovered. Periodically enabling polling might resolve this. (This is all hypothetical, I'm not looking at the code currently.)

This is probably true. I believe @jxiong has also confirmed POLLIN is never reset. Maybe in xnet_handle_tag when we check if the unexp_entry is empty, we could re-enable polling in the else case (where a unexp_entry is already tracked and we're previously disabled polling)

jxiong · 2025-02-12T19:56:00Z

An option to discard any unexpected tagged message is possible, but I don't know what apps could actually use it. Unexpected messages are a way of life with tagged messages.

This will definitely help us. It would be a good workaround for us while we're looking for a long term solution. And it will be pretty safe for this use case because daos will pre-post receiving buffer while sending a RPC as far as I can see, therefore messages with unmatched tag can be discarded.

The problem here is that the application states are out of sync. I don't know that this can be fixed by a lower-layer network library.

If the messages can be properly delivered to upper layer, the state can be recovered.

The actual problem here is that for a listening socket, the operations of accepting endpoint and reading message is sort of atomic, therefore the upper layer doesn't have a chance to properly 'configure' the new endpoint (setting fi_addr), which causes the issue that libfabric lacks of necessary metadata to handle the messages. If the lower-layer libfabric could split accepting and polling, like first accept the connecting request and create a new endpoint, and let the mercury to configure the endpoint and then start polling, that should solve the problem.

Of course, mercury should encode the tag with incarnation value so that it can identify stale messages. Right now, it's possible that a stale reply could be wrongly used as a legit reply.

shefty · 2025-02-12T20:50:04Z

@jxiong - I'm okay with adding some EP configuration that can always discard unexpected tagged messages. Maybe this could be a new FI_TAG_RPC_CLIENT setting.

@ooststep - We would want some mechanism that can work in the absence of the application needing to do anything. I don't have a good answer here.

jxiong · 2025-02-12T22:34:17Z

@shefty sorry I don't get it. Can you please give me a little more context?

or can I just add a configurable item here https://github.com/ofiwg/libfabric/blob/main/prov/tcp/src/xnet_init.c#L76

shefty · 2025-02-12T22:53:13Z

fi_ep_attr::mem_tag_format allows the app to negotiate with the provider regarding how tagged messages are used. The v1 mem_tag_format ended up being rather useless. The v2 API allows for the mem_tag_format to be FI_TAG_MPI or FI_TAG_CCL. The provider can optimize for the differences. For example, FI_TAG_CCL does not allow for wildcard source address (FI_ANY_ADDR) matching.

We can introduce a new FI_TAG_RPC or FI_TAG_RPC_CLIENT with similar restrictions. I.e. wildcard source matching can't be used. Receive buffers will always be pre-posted, so unexpected messages aren't needed. Etc.

DAOS has a different use case for tag messages than MPI or CCL. Let's capture it properly. Providers than have an option to optimize for that use case, or continue supporting it with default behavior.

jxiong · 2025-02-13T22:43:57Z

mercury pr: mercury-hpc/mercury#784

@shefty can you please take a look if the current pr is okay for you?

shefty · 2025-02-14T00:34:17Z

include/rdma/fabric.h

@@ -346,6 +346,7 @@ enum {
 	FI_TAG_BITS,
 	FI_TAG_MPI,
 	FI_TAG_CCL,
+	FI_TAG_RPC,


Please move this change to a separate PR that also updates to the man page which describes what exactly this tag format implies. A separate PR is preferred, as this is an update to the API which needs broader review than just the tcp provider.

#10792 <- I moved the tag definition in this PR

This new mem_tag_format tells libfabric tha tags are being used to match RPC requests and replies. Therefore, messages with unmatched tags are stale and can be discarded by libfabric. Signed-off-by: Jinshan Xiong <jinshanx@google.com>

When an communication instance is restarted, the new instance may still receive 'stale' tagged replies that belong to the previous instance. In the current implementation, it will disable the polling against the new endpoint and then leave lots of normal messages unprocessed. This PR is trying to fix the issue by dropping those messages with unmatched tag and if the endpoint is entirely new. DAOS ticket: https://daosio.atlassian.net/browse/DAOS-17103 Signed-off-by: Jinshan Xiong <jinshanx@google.com>

jxiong mentioned this pull request Feb 13, 2025

NA tcp: Add FI_TAG_RPC support mercury-hpc/mercury#784

Open

shefty reviewed Feb 14, 2025

View reviewed changes

jxiong added 4 commits February 14, 2025 17:38

core: Introduce new tag FI_TAG_RPC

bce443c

This new mem_tag_format tells libfabric tha tags are being used to match RPC requests and replies. Therefore, messages with unmatched tags are stale and can be discarded by libfabric. Signed-off-by: Jinshan Xiong <jinshanx@google.com>

Add FI_TAG_RPC to discard unexpected tagged messages

bfb7043

negotiate the mem_tag_format

811223d

jxiong force-pushed the jxiong/tcp_tag branch from 6b91a7a to 811223d Compare February 15, 2025 01:43

jxiong mentioned this pull request Feb 19, 2025

core: Introduce new tag FI_TAG_RPC #10792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/tcp: drop stale messages for new endpoint #10783

prov/tcp: drop stale messages for new endpoint #10783

jxiong commented Feb 11, 2025

jxiong commented Feb 11, 2025 •

edited

Loading

shefty commented Feb 12, 2025

jxiong commented Feb 12, 2025 •

edited

Loading

jxiong commented Feb 12, 2025

shefty commented Feb 12, 2025

jxiong commented Feb 12, 2025 •

edited

Loading

soumagne commented Feb 12, 2025 •

edited

Loading

jxiong commented Feb 12, 2025

shefty commented Feb 12, 2025

ooststep commented Feb 12, 2025

jxiong commented Feb 12, 2025 •

edited

Loading

shefty commented Feb 12, 2025

jxiong commented Feb 12, 2025

shefty commented Feb 12, 2025

jxiong commented Feb 13, 2025

shefty Feb 14, 2025

jxiong Feb 15, 2025

prov/tcp: drop stale messages for new endpoint #10783

Are you sure you want to change the base?

prov/tcp: drop stale messages for new endpoint #10783

Conversation

jxiong commented Feb 11, 2025

jxiong commented Feb 11, 2025 • edited Loading

shefty commented Feb 12, 2025

jxiong commented Feb 12, 2025 • edited Loading

jxiong commented Feb 12, 2025

shefty commented Feb 12, 2025

jxiong commented Feb 12, 2025 • edited Loading

soumagne commented Feb 12, 2025 • edited Loading

jxiong commented Feb 12, 2025

shefty commented Feb 12, 2025

ooststep commented Feb 12, 2025

jxiong commented Feb 12, 2025 • edited Loading

shefty commented Feb 12, 2025

jxiong commented Feb 12, 2025

shefty commented Feb 12, 2025

jxiong commented Feb 13, 2025

shefty Feb 14, 2025

Choose a reason for hiding this comment

jxiong Feb 15, 2025

Choose a reason for hiding this comment

jxiong commented Feb 11, 2025 •

edited

Loading

jxiong commented Feb 12, 2025 •

edited

Loading

jxiong commented Feb 12, 2025 •

edited

Loading

soumagne commented Feb 12, 2025 •

edited

Loading

jxiong commented Feb 12, 2025 •

edited

Loading