-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build-and-test workflow is failing due to CUDA runtime version #7668
Comments
@xmfcx |
I tried to reproduce the error using a docker and following the ci/cd commands whenever possible.
It seems it is an issue on the docker side:
|
Thanks @knzo25 , this explains why it specifically failed for the self hosted machines. I will try installing the driver on the host machines and see. |
I've installed CUDA 12.3 to both machines. Running again: |
They both failed with the same error: https://github.com/autowarefoundation/autoware.universe/actions/runs/9675999034/job/26698258077#step:15:22044 😕 The host machines have the necessary stuff and I've updated the rest of the machines with Here are the results from host machine for both:
|
I have about the same as you 😢
I was going to recommend a reboot but you already did it. In the past when changing versions, unload and reload the kernel modules worked when nvidia-smi did not work, but this is not the case. Do you have the cuda samples in that machine to check if those run? |
I followed the regular steps as always while installing from here: https://github.com/autowarefoundation/autoware/tree/main/ansible/roles/cuda#manual-installation I will install the nvidia-driver-550 and try again. (this is what I have on my daily work pc as well) |
I think I've understood everything now. VoxelGeneratorTest.TwoFramesNoTf this test is failing. What are the changes that caused it?
CUDA calls are being made. I think before these, no serious CUDA calls were being made. Most of them could run on cpu too probably. What are the runner specs?GitHub hosted runnersThese are CPU only, here are their specs. Right now every job except:
are running on them. Self-hosted runnersAnd we have 2 machines here:
These run:
Then how did it pass the b&t-diff in the first place then?This is the first fishy part from the lidar_centerpoint PR b&t-diff CI run:
On my high end machine,
This is too fast for this package. And looking at its tests: Almost no tests are performed, including I didn't investigate deeper on why this didn't run. VerdictI think, until this PR, no serious CUDA code was in the colcon tests before. For CUDA only tests to run, we need CUDA capable machines with GPUs. These tests cannot be done on neither GitHub hosted machines nor the AWS cpu-only runner that we have. I have now installed the nvidia_container_toolkit on the Started the test again. But it will probably fail because I don't think when GitHub initiates the containers, it passes I will look into it to see how I can do that for that machine. |
Found out the bug in: |
The tests are alright because when I've configured leo-copper machine, it passed entire build-and-test successfully.
But we will have to disable the tests that fail on non-cuda capable machines. Because we don't have the infrastructure ready to handle gpu based testing for every PR etc. I will open up an issue to track disabled tests that should be re-enabled once CUDA capable machines are back. |
@knzo25 starting from this PR, The CI for
build-and-test
started failing:For some reason the has CI passed for the
build-and-test-differential
cuda yet it fails thebuild-and-test
checks.Originally posted by @xmfcx in #6989 (comment)
The text was updated successfully, but these errors were encountered: