Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cornelis upstream 2025 2 28 #10833

Merged
merged 41 commits into from
Feb 28, 2025

Conversation

charlesshereda
Copy link
Contributor

Changes to opx provider code include:

  • Reformat using a somewhat-modified version of the upstream clang-format rules
  • Removal of sending immediate data in send_rzv when not using host mem (performance)
  • More code changes to enable support for CN5000, including entropy support and 3B LID changes
  • Rate control defaults
  • RZV payload processing improvements
  • fi_writedata() implementation and resolution of associated reliability errors
  • Fixes to hint checking and capabilities setting
  • Cuda managed/unified memory handling
  • pkey default change and other pkey fixes
  • HFI1 direct verbs support
  • Enhanced support for our simulation environment
  • Coverity scan defect resolution
  • Debug print fixes
  • Link bounce support for CN5000 (JKR) and associated fixes
  • FI_MR_VIRT_ADDR implementation
  • Set route control based on packed type for CN5000
  • OPX Tracer fixes
  • Enable TID by default
  • Default to writing a CQ entry for a successful data xfer operation by default
  • Default RTS/CTS to in-order route control for CN5000
  • Replacement of an intranode hashmap with an array
  • Changes to fi_opx_addr in preparation for context sharing changes
  • Switch to using cycle timer if CPUs are on same socket
  • Cornelis Networks github actions changes
  • Removal of reliability handshake
  • Modifications to processing of unexpected packets
  • Disable out-of-order route control if TID is enabled for CN5000
  • Add HMEM handle for GDRCopy for GET and PUT
  • Move Cuda sync attribute setting to MR registration
  • Other cleanup and minor bug fixes

belynam and others added 30 commits February 28, 2025 11:55
…ot host memory

Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Changes to support 3B Lids in 16B headers for JKR.

Signed-off-by: Archana Venkatesha <archana.venkatesha@cornelisnetworks.com>
Out of order headers are always queued for processing in PSN order.

However, now RZV payloads will be processed immediately, even if
they are out of order.

Add a debug counter for this.

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
- OPX now checks the applications requested capabilites and will fail
  if OPX cannot support what they are asking for. OPX will log this.
- OPX now explicity sets its default caps if the application does not
  set any cap hints. OPX logs when this occurs.
- OPX now respects the applications hints. Whatever the application
  requests for general caps is exactly what OPX will set. Prior to this,
  if OPX could support all of the requested/hinted caps, it wouldn't set
  capabilites to be equal to what the app requested. Instead, OPX would
  override it with additional capabilities, which should only be allowed
  if the application does not provide hints->caps.
- OPX now properly sets the tx_attr->caps and rx_attr->caps to be equal
  to OPX defaults (in the case where hints were null) or to what the
  application requested (when the hints are actually provided).
- OPX now passes the correct capabilities to the choose_domain function.

Signed-off-by: Thomas Huber <thomas.huber@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Archana Venkatesha <archana.venkatesha@cornelisnetworks.com>
Signed-off-by: Elias Kozah <elias.elkozah@cornelisnetworks.com>
Signed-off-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Signed-off-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Signed-off-by: Thomas Huber <thomas.huber@cornelisnetworks.com>
Add build and runtime support for the HFI1 direct verbs interface.

Configure support is based on finding infiniband/hfi1dv.h.
Configure with OPX_RDMA_CORE=0 to disable it.

Runtime support is based on dlopen() of the hfi1verbs library.
The fallback is the old character device ioctl() interface.

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
-Mixed network internal test only support
-Additional header debug (FI_WARN)
-Remove unnecessary FI_WARN's.
-New link bounce warnings
-Add internal use only, undocumented
  FI_OPX_VERBOSE_SELECTION

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Elias Kozah <elias.elkozah@cornelisnetworks.com>
Co-authored-by: Elias Kozah <elias.elkozah@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Archana Venkatesha <archana.venkatesha@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Added mr_mode bit check. OPX will set mr_mode to include
FI_MR_VIRT_ADDR if the application requests it. We check whether we want
to use FI_MR_VIRT_ADDR features by examining the domain's mr_mode bit that
OPX sets.

Signed-off-by: Thomas Huber <thomas.huber@cornelisnetworks.com>
Co-authored-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
Define hfi packet "types" for OPX protocols to enable
finer grained control of these protocols.

Set route control for those protocol packets.

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Thomas Huber <thomas.huber@cornelisnetworks.com>
The environment variable FI_OPX_EXPECTED_RECEIVE_ENABLE is
now deprecated, and expected receive (TID) is now enabled by
default.

A new environment variable FI_OPX_TID_DISABLE can be used to
override the default and disable TID by specifying
FI_OPX_TID_DISABLE=1.

Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
"Partial" updates are normal and shouldn't generate a warning.

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Correct context flags for a completed fi_writedata,
 set origin_rs for RMA RTS/CTS exchange,
 fix pio_state update,
 fix payload bytes setting on RMA RTS send,
 and fix RMA tx/rx cq and op_flags usage

Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
…fault

Update OPX so that by default, data transfer operations write CQ
completion entries into the associated completion queue after they have successfully completed.

Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
-Update the man page

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Changing fi_opx_addr to include subctxt and 9 bit RX, and changes to reliability handling to use Rx instead of Tx.

Signed-off-by: Archana Venkatesha <archana.venkatesha@cornelisnetworks.com>
belynam and others added 11 commits February 28, 2025 12:02
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
…flow triggers

Signed-off-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Charles Shereda <charles.shereda@cornelisnetworks.com>
… (TID) packets

Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Removes unnecessary code to check whether packets in the unexpected queue were intranode.
Signed-off-by: Cody Mann <cody.mann@cornelisnetworks.com>
Temporary work around for a known bug.

Signed-off-by: Bob Cernohous <bob.cernohous@cornelisnetworks.com>
Signed-off-by: Ben Lynam <Ben.Lynam@cornelisnetworks.com>
Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
Signed-off-by: Lindsay Reiser <lindsay.reiser@cornelisnetworks.com>
@charlesshereda
Copy link
Contributor Author

@j-xiong Should I create a separate PR against the release candidate?

@j-xiong
Copy link
Contributor

j-xiong commented Feb 28, 2025

@charlesshereda No, I will rebase my PR.

@j-xiong j-xiong merged commit 21898d1 into ofiwg:main Feb 28, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants