Skip to content

Commit

Permalink
man/fi_domain: Define resource mgmt unreachable EP
Browse files Browse the repository at this point in the history
Resource management unreachable EP addresses the issue of issuing RDMA
operations to a connectionless EP which cannot be reach. Such examples
include no-route-to-host or target NIC down. Defining this behavior is
important for storage use-cases where NICs may unexpectedly disappear.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
  • Loading branch information
iziemba committed Mar 4, 2025
1 parent 999bba5 commit c254ace
Showing 1 changed file with 28 additions and 18 deletions.
46 changes: 28 additions & 18 deletions man/fi_domain.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,17 +413,18 @@ the endpoint is reliable or unreliable, as well as provider and protocol
specific implementation details, as shown in the following table. The
table assumes that all peers enable or disable RM the same.

| Resource | DGRAM EP-no RM | DGRAM EP-with RM | RDM/MSG EP-no RM | RDM/MSG EP-with RM |
|:--------:|:-------------------:|:-------------------:|:------------------:|:-----------------:|
| Tx Ctx | undefined error | EAGAIN | undefined error | EAGAIN |
| Rx Ctx | undefined error | EAGAIN | undefined error | EAGAIN |
| Tx CQ | undefined error | EAGAIN | undefined error | EAGAIN |
| Rx CQ | undefined error | EAGAIN | undefined error | EAGAIN |
| Target EP | dropped | dropped | transmit error | retried |
| No Rx Buffer | dropped | dropped | transmit error | retried |
| Rx Buf Overrun | truncate or drop | truncate or drop | truncate or error | truncate or error |
| Unmatched RMA | not applicable | not applicable | transmit error | transmit error |
| RMA Overrun | not applicable | not applicable | transmit error | transmit error |
| Resource | DGRAM EP-no RM | DGRAM EP-with RM | MSG EP-no RM | MSG EP-with RM | RDM EP-no RM | RDM EP-with RM |
|:--------:|:-------------------:|:-------------------:|:------------------:|:-----------------:| :------------------:|:-----------------:|
| Tx Ctx | undefined error | EAGAIN | undefined error | EAGAIN | undefined error | EAGAIN |
| Rx Ctx | undefined error | EAGAIN | undefined error | EAGAIN | undefined error | EAGAIN |
| Tx CQ | undefined error | EAGAIN | undefined error | EAGAIN | undefined error | EAGAIN |
| Rx CQ | undefined error | EAGAIN | undefined error | EAGAIN | undefined error | EAGAIN |
| Target EP | dropped | dropped | transmit error | retried | transmit error | retried |
| No Rx Buffer | dropped | dropped | transmit error | retried | transmit error | retried |
| Rx Buf Overrun | truncate or drop | truncate or drop | truncate or error | truncate or error | truncate or error | truncate or error |
| Unmatched RMA | not applicable | not applicable | transmit error | transmit error | transmit error | transmit error |
| RMA Overrun | not applicable | not applicable | transmit error | transmit error | transmit error | transmit error |
| Unreachable EP | dropped | dropped | not applicable | not applicable | transmit error | transmit error |

The resource column indicates the resource being accessed by a data
transfer operation.
Expand Down Expand Up @@ -482,19 +483,28 @@ transfer operation.
operations, or attempt to access outside of the target memory region
will fail, resulting in a transmit error.

*Unreachable EP*
: Unreachable endpoint is a connectionless specific scenario where transmit
operations are issued to unreachable target endpoints. Such scenarios include
no-route-to-host or down target NIC. For FI_EP_DGRAM endpoints, transmit
operations targeting an unreachable endpoint will have operation dropped. For
FI_EP_RDM, target operations targeting an unreachable endpoint will result in
a transmit error. A provider may choose to set the completion error code to
FI_EHOSTUNREACH signaling to user the target endpoint is unreachable.

When a resource management error occurs on an connected endpoint, the endpoint
will transition into a disabled state and the connection torn down. While
transitioning to disabled, any queued and inflight operations will be dropped.
Connection must be re-established for endpoint to be usable.

When a resource management error occurs on an connectionless endpoint, Target
EP, No Rx Buffer, Rx Buf Overrun, Unmatched RMA, and RMA Overrun errors will not
result in endpoint transitioning into a disabled state. Tx Ctx, Rx Ctx, Tx CQ,
and Rx CQ errors will transition the endpoint into a disabled state. While
transitioning to disabled, any queued and inflight, local transmit operations
will be dropped. Endpoints targeting a disabled EP must adhere to the Target EP
behavior. If the endpoint becomes disabled, the endpoint must be re-enabled
before it will accept new data transfer operations.
EP, No Rx Buffer, Rx Buf Overrun, Unmatched RMA, RMA Overrun, and Unreachable EP
errors will not result in endpoint transitioning into a disabled state. Tx Ctx,
Rx Ctx, Tx CQ, and Rx CQ errors will transition the endpoint into a disabled
state. While transitioning to disabled, any queued and inflight, local transmit
operations will be dropped. Endpoints targeting a disabled EP must adhere to the
Target EP behavior. If the endpoint becomes disabled, the endpoint must be
re-enabled before it will accept new data transfer operations.

There is one notable restriction on the protections offered by resource
management. This occurs when resource management is enabled on an
Expand Down

0 comments on commit c254ace

Please sign in to comment.