Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: PCI BDF returned by rsmi_dev_pci_id_get() wrong on partitioned MI300A? #208

Open
bgoglin opened this issue Nov 27, 2024 · 6 comments

Comments

@bgoglin
Copy link

bgoglin commented Nov 27, 2024

Problem Description

Hello
I am debugging a hwloc issue with users from the El Capitan supercomputer (MI300A). I don't have access to the hardware, hence it's a bit complicated for me to get all details.
It looks like the PCI BDF returned by rsmi_dev_pci_id_get() is wrong when called on the non-first partition of a partitioned MI300A. The root GPU BDF is something like 0001:02:00.0. rocm-smi --showbus reports BDF like 0001:02:00.1 and 0001:02.00.2 for 2nd and 3rd partitions. I first through that you were hotplugging additional PCI functions when partitioning a GPU, but that doesn't seem to be the case. According to my contact, these PCI BDFs do not actually exist in the system (lspci) even after enabling partitioning.
I looked at the documentation of rsmi_dev_pci_id_get(), it says that the partition IDs is actually encoded in the 64bit returned valued between the bus and domain, not inside the PCI function bits.
Is this a documentation bug? Or an implementation bug?

Operating System

Linux

CPU

MI300A

GPU

MI300A

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@ppanchad-amd
Copy link

Hi @bgoglin. Internal ticket has been created to investigate your issue. Thanks!

@jamesxu2
Copy link
Contributor

jamesxu2 commented Dec 4, 2024

Hi @bgoglin, just to update you - I've been able to reproduce this issue on a local MI300A system. Thanks for the detailed description, and I'll follow up once I know more.

@jamesxu2
Copy link
Contributor

Hi @bgoglin, here are my findings:

According to my contact, these PCI BDFs do not actually exist in the system (lspci) even after enabling partitioning.

While rocm-smi shows the partition ID as embedded in the bottom of the function bits, these devices do not show up in lspci. The pre-partitioned device is the only one with an "accurate" BDF reported by rocm-smi. This is intentional: Function ID manipulation only happens inside the ROCm stack, and this partitioning is not reflected to the rest of the OS (we are not, as you say, hotplugging PCIe devices).

We have alternative methods to address individual partitions, like the deviceID API.

Do you have a specific usecase where this difference between BDFs reported by rocm-smi and visible to the rocm stack, versus the BDFs visible to the OS, causes an issue?

[SPX mode]
$ rocm-smi --showhw

================================= ROCm System Management Interface =================================
====================================== Concise Hardware Info =======================================
GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS   VBIOS  BUS           PARTITION ID
0    4     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0000:01:00.0  0
1    5     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0001:01:00.0  0
2    6     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0002:01:00.0  0
3    7     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0003:01:00.0  0
====================================================================================================
======================================= End of ROCm SMI Log ========================================

[TPX mode]
$ rocm-smi --showhw


================================= ROCm System Management Interface =================================
====================================== Concise Hardware Info =======================================
GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS   VBIOS  BUS           PARTITION ID
0    4     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0000:01:00.0  0
1    5     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0000:01:00.1  1
2    6     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0000:01:00.2  2
3    7     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0001:01:00.0  0
4    8     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0001:01:00.1  1
5    9     0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0001:01:00.2  2
6    10    0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0002:01:00.0  0
7    11    0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0002:01:00.1  1
8    12    0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0002:01:00.2  2
9    13    0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0003:01:00.0  0
10   14    0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0003:01:00.1  1
11   15    0x74a0  XXXXX gfx940   ENABLED  ENABLED   DISABLED  N/A    0003:01:00.2  2
====================================================================================================
======================================= End of ROCm SMI Log ========================================

@bgoglin
Copy link
Author

bgoglin commented Dec 12, 2024

The usecase is @eleon from LLNL on the El Capitan supercomputer using hwloc in https://github.com/LLNL/mpibind. hwloc gets each partition from the ROCm SMI lib but it then fails to place that those GPU partitions in the (PCI) topology because the reported BDF doesn't exist. I understand your explanation above, but it seems to contradict what the documentation of rsmi_dev_pci_id_get() says? I'd like a clarification of the doc to handle this case. For instance, may I assume that everytime ROCm SMI reports a PCI function F > 0, it means it's actually partition #F of the PCI device with function = 0 ?

@jamesxu2
Copy link
Contributor

@bgoglin I agree that should be documented better.

For instance, may I assume that everytime ROCm SMI reports a PCI function F > 0, it means it's actually partition #F of the PCI device with function = 0 ?

Yes, for MI-series devices, the pcie function bits are only used to reflect the partition ID. I'll start on a documentation change to clarify this, thanks for bringing it up.

bgoglin added a commit to bgoglin/hwloc that referenced this issue Dec 13, 2024
rsmi_dev_pci_id_get() returns the GPU "partition ID" inside the PCI BDF function,
but this virtual function isn't actually exposed to the OS.
See ROCm/rocm_smi_lib#208 for details.

When hwloc fails to find the corresponding PCI device, if the BDF function is > 0,
get the RSMI partition ID, compare it with the BDF function, and try to get the
PCI device with func = 0 instead.

Thanks to Edgar Leon for reporting the issue.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
bgoglin added a commit to bgoglin/hwloc that referenced this issue Dec 14, 2024
rsmi_dev_pci_id_get() returns the GPU "partition ID" inside the PCI BDF function,
but this virtual function isn't actually exposed to the OS.
See ROCm/rocm_smi_lib#208 for details.

When hwloc fails to find the corresponding PCI device, if the BDF function is > 0,
get the RSMI partition ID, compare it with the BDF function, and try to get the
PCI device with func = 0 instead.

Thanks to Edgar Leon for reporting the issue.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
bgoglin added a commit to bgoglin/hwloc that referenced this issue Jan 10, 2025
rsmi_dev_pci_id_get() returns the GPU "partition ID" inside the PCI BDF function,
but this virtual function isn't actually exposed to the OS.
See ROCm/rocm_smi_lib#208 for details.

When hwloc fails to find the corresponding PCI device (usually gets the above bridge instead),
if the BDF function is > 0, get the RSMI partition ID, compare it with the BDF function,
and try to get the PCI device with func = 0 instead.

rsmi_dev_partition_id_get() was only added in ROCm 6.2, so configure-check it.

Thanks to Edgar Leon for reporting and debugging the issue.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
bgoglin added a commit to open-mpi/hwloc that referenced this issue Jan 10, 2025
rsmi_dev_pci_id_get() returns the GPU "partition ID" inside the PCI BDF function,
but this virtual function isn't actually exposed to the OS.
See ROCm/rocm_smi_lib#208 for details.

When hwloc fails to find the corresponding PCI device (usually gets the above bridge instead),
if the BDF function is > 0, get the RSMI partition ID, compare it with the BDF function,
and try to get the PCI device with func = 0 instead.

rsmi_dev_partition_id_get() was only added in ROCm 6.2, so configure-check it.

Thanks to Edgar Leon for reporting and debugging the issue.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
(cherry picked from commit aef721f)
@bgoglin
Copy link
Author

bgoglin commented Jan 27, 2025

I assume the documentation change is 67a0de4

rahulc1984 pushed a commit that referenced this issue Feb 18, 2025
- To address #208
where use of fake BDFs for partitions can cause confusion. This note
is already in the comments of the function definition, but was not
updated in the function declaration.
- Fix broken formatting for the location table for PCIE coordinate fields
- Tracked in SWDEV-501108

Change-Id: Ic85439866cb836bb43acc52314a7f1d026c3215d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants