-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/tcp: drop stale messages for new endpoint #10783
base: main
Are you sure you want to change the base?
Conversation
@shefty I just want to bring this PR into your attention. It can fix the problem on our end but I'm not sure if this is a proper fix. I'm open to suggestions. |
Tagged messages can invoke a 3-way handshake:
Is the problem you're reporting that the sender of the cts restarts, such that when 'data' arrives, it doesn't match a receive anymore? In the eager message protocol, tagged messages are just sent using xnet_op_tag. There is no reply (except for delivery complete, which generates an ACK). The patch modifies xnet_handle_tag(), which is called when receiving either xnet_op_tag_rts or xnet_op_tag. In both cases, these show up as new messages and would take the unexpected message path. In the case an xnet_op_cts or xnet_op_data message arrives unexpectedly, the code should trap that and report an error internally. That should tear the connection down, which is probably correct, since somehow the 2 peers aren't talking who they think they're talking with. |
(I don't know what
The reply refers to the RPC reply in terms of upper layer RPC processing.
It is @soumagne feel free to chime in here |
When this happens, it will leave a lot of data in the TCP recvq and sendq, as the follows:
In this case, the server got restarted was daos-server-4. It will block the message processing and leave a lot of data in TCP recvq |
RTS = ready to send. CTS = clear to send. RTS is a request to send. CTS is an ack that it's now okay to send. I don't understand the issue. Was the tcp socket not torn down? The server restarted, but kept the socket between instances? xnet_handle_tag() is processing new messages. |
@shefty thanks for looking into this. This is the log I collected non_rescure.log.gz with this patch applied: logs.diff.gz Please look for the message
After the read error message, there is no more activities for the fd 2257 The log that proved that the tag was used in previous instance is here (I didn't save this one though):
From mercury, the tag value is started from 1 and monotonously increased. As you can see, the first message after starting has the tag value 18, which matched the tag value of last posted reciving buffer. |
Adding some context to what @jxiong is seeing. We use RDM endpoints, he has in this example 2 peers regularly pinging each other by sending RPCs (request + response). At some point one of the peers goes down while the other peer is in the process of sending an RPC response back to that peer on a specific tag value (the tag value is currently unique per RPC). When the peer goes down, it does tear down the connection as expected, but the other peer is still trying to send its response. When the peer is brought back up and the endpoint is recreated, the other peer that is still trying to send its response reconnects and the response is now seen as an unexpected message, since in this case no recv has been preposted yet as it belonged to the previous instance and there will never be any matching recv in the future either. What @jxiong was trying to do is that instead of disabling the polling, he was trying to drop the message for which no tag could be matched against so that the recv buffer would not potentially become full. |
An option to discard any unexpected tagged message is possible, but I don't know what apps could actually use it. Unexpected messages are a way of life with tagged messages. The problem here is that the application states are out of sync. I don't know that this can be fixed by a lower-layer network library. Now, there may be an issue when polling on a socket is disabled. Polling is disabled when it can't make progress until the application takes some action, usually posting a receive. Leaving it enabled results in never ending poll events. If the socket is disconnected while polling is disabled, that may never be discovered. Periodically enabling polling might resolve this. (This is all hypothetical, I'm not looking at the code currently.) |
This is probably true. I believe @jxiong has also confirmed POLLIN is never reset. Maybe in xnet_handle_tag when we check if the unexp_entry is empty, we could re-enable polling in the else case (where a unexp_entry is already tracked and we're previously disabled polling) |
This will definitely help us. It would be a good workaround for us while we're looking for a long term solution. And it will be pretty safe for this use case because daos will pre-post receiving buffer while sending a RPC as far as I can see, therefore messages with unmatched tag can be discarded.
If the messages can be properly delivered to upper layer, the state can be recovered. The actual problem here is that for a listening socket, the operations of accepting endpoint and reading message is sort of atomic, therefore the upper layer doesn't have a chance to properly 'configure' the new endpoint (setting Of course, mercury should encode the tag with incarnation value so that it can identify stale messages. Right now, it's possible that a stale reply could be wrongly used as a legit reply. |
@jxiong - I'm okay with adding some EP configuration that can always discard unexpected tagged messages. Maybe this could be a new FI_TAG_RPC_CLIENT setting. @ooststep - We would want some mechanism that can work in the absence of the application needing to do anything. I don't have a good answer here. |
@shefty sorry I don't get it. Can you please give me a little more context? or can I just add a configurable item here https://github.com/ofiwg/libfabric/blob/main/prov/tcp/src/xnet_init.c#L76 |
We can introduce a new FI_TAG_RPC or FI_TAG_RPC_CLIENT with similar restrictions. I.e. wildcard source matching can't be used. Receive buffers will always be pre-posted, so unexpected messages aren't needed. Etc. DAOS has a different use case for tag messages than MPI or CCL. Let's capture it properly. Providers than have an option to optimize for that use case, or continue supporting it with default behavior. |
mercury pr: mercury-hpc/mercury#784 @shefty can you please take a look if the current pr is okay for you? |
@@ -346,6 +346,7 @@ enum { | |||
FI_TAG_BITS, | |||
FI_TAG_MPI, | |||
FI_TAG_CCL, | |||
FI_TAG_RPC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this change to a separate PR that also updates to the man page which describes what exactly this tag format implies. A separate PR is preferred, as this is an update to the API which needs broader review than just the tcp provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#10792 <- I moved the tag definition in this PR
This new mem_tag_format tells libfabric tha tags are being used to match RPC requests and replies. Therefore, messages with unmatched tags are stale and can be discarded by libfabric. Signed-off-by: Jinshan Xiong <jinshanx@google.com>
When an communication instance is restarted, the new instance may still receive 'stale' tagged replies that belong to the previous instance. In the current implementation, it will disable the polling against the new endpoint and then leave lots of normal messages unprocessed. This PR is trying to fix the issue by dropping those messages with unmatched tag and if the endpoint is entirely new. DAOS ticket: https://daosio.atlassian.net/browse/DAOS-17103 Signed-off-by: Jinshan Xiong <jinshanx@google.com>
6b91a7a
to
811223d
Compare
When an communication instance is restarted, the new instance may still receive 'stale' tagged replies that belong to the previous instance. In the current implementation, it will disable the polling against the new endpoint and then leave lots of normal messages unprocessed.
This PR is trying to fix the issue by dropping those messages with unmatched tag and if the endpoint is entirely new.
DAOS ticket: https://daosio.atlassian.net/browse/DAOS-17103