build: Add Docker images for CUDA development #12413

bdice · 2025-02-20T21:41:53Z

This PR is the first step towards adding a RAPIDS cuDF backend (#12412). It adds CUDA to the adapters CI images. This container will allow us to share a development environment that has CUDA compilers and libraries along with the existing Velox container infrastructure.

netlify · 2025-02-20T21:42:15Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`a2fc5b3`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67d04eb6a27abb0008c46564

assignUser · 2025-02-22T12:52:36Z

As discussed we will likely run the build in the 'adapters' job which is based on a centos image, so installing 12.8 there by default would be the way to go (currently installed in the workflow). If you want these for local development that's fine or is there a reason to prefer ubuntu?

bdice · 2025-02-22T15:00:35Z

Thanks @assignUser. I am more familiar with Ubuntu but I think we can make that change. I’ll work on that early next week.

bdice · 2025-03-03T17:51:14Z

@assignUser It seems like the adapters images don't use CUDA. I would need to change the base image to something deriving from nvidia/cuda (probably the base flavor, and install our own CUDA Toolkit) for the best driver compatibility. Currently this derives from ghcr.io/facebookincubator/velox-dev:centos9. I guess I would need to change the adapters.dockerfile to do the same setup from that centos container on top of an nvidia/cuda RHEL-compatible image, and then add the rest of the adapters. Is that still the desired path here?

If there is a better way to do this, please let me know:

Edit adapters.dockerfile to start from nvidia/cuda:12.8.0-base-rockylinux9
Copy-paste centos.dockerfile contents into adapters.dockerfile (I can't start from that velox-dev image, since I want a CUDA-friendly base)
Add commands to install CUDA (and build/install cuDF into the image?)
Run the setup-adapters.sh scripts to install everything else

assignUser · 2025-03-03T19:10:18Z

Hm,if you want to use a cuda base image that makes things a bit trickier. We install cuda using this function, currently in the workflow as it's so fast that I haven't had a reason to move it into the dockerfile but we can do that. Unless that is lacking in some way compared to a proper cuda base image?

bdice · 2025-03-03T22:17:46Z

There's some additional logic in the nvidia/cuda base images that makes driver compatibility much broader than what a "bare" image would have (using cuda-compat, NVIDIA_REQUIRE_CUDA environment variables, etc.). Recreating that in a separate image is possible but I would advise using nvidia/cuda base images instead.

You can see some of that here: https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/rockylinux9/base/Dockerfile?ref_type=heads

jinchengchenghh · 2025-03-04T14:02:12Z

scripts/ubuntu-22.04-cuda-12.8-cpp.dockerfile

@@ -0,0 +1,37 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+#


I meet this exception

docker build -t cudf-velox -f scripts/ubuntu-22.04-cuda-12.8-cpp.dockerfile ./ [+] Building 2.2s (2/2) FINISHED docker:default => [internal] load build definition from ubuntu-22.04-cuda-12.8-cpp.dockerfile 0.0s => => transferring dockerfile: 1.23kB 0.0s => ERROR [internal] load metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 2.0s ------ > [internal] load metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04: ------ ubuntu-22.04-cuda-12.8-cpp.dockerfile:18 -------------------- 16 | ARG tz="Europe/Madrid" 17 | 18 | >>> FROM ${base} 19 | 20 | SHELL ["/bin/bash", "-o", "pipefail", "-c"] -------------------- ERROR: failed to solve: nvidia/cuda:12.8.0-devel-ubuntu22.04: failed to resolve source metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/nvidia/cuda/manifests/sha256:54f18e2a8e1b3d03f77b9a6dc905533da46ac93a5513f10e8ba8e560db9fa5ab: 429 Too Many Requests - Server message: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit

I know it is because docker hub has throttling for pulling the images. Is there any chance to avoid this exception?

assignUser · 2025-03-04T20:25:31Z

@bdice I haven't had time to come to a final conclusion but I looked at image sizes and the rocky cuda image is ~500M vs ~250M for stream9. While that is a big difference until we work on our final images sizes it doesn't really matter xD As the centos9 image is ~2G and the adapters one is a whopping 7G. I'll have to investigate if we can reduce that size (#12527).

But for this issue it means that imo we can just use the cuda base image for centos.dockerfile (with what ever additions you need added to the adapters image?). Rocky should be a good 1:1 replacement for stream9 afaik?

bdice · 2025-03-04T20:57:37Z

Great! I will work on that.

Rocky should be a good 1:1 replacement for stream9 afaik?

Yes, it should be functionally equivalent.

assignUser · 2025-03-04T21:03:38Z

Cc @majetideepak @kgpai any concerns?

bdice · 2025-03-04T21:33:56Z

I should clarify, now that I'm looking at the numbers in your comment again -- cuDF needs the devel flavor of nvidia/cuda in order to build. The image nvidia/cuda:12.8.0-devel-rockylinux9 is about 4.62 GB compressed.

Given that, would my original approach of having a separate image for CUDA development (rather than increasing the size of adapters image) be worth considering?

assignUser · 2025-03-04T22:21:03Z

Ah, hm. Well my concern regarding the size is mostly about the time it takes CI to fetch it but using ghcr.io in gha is well cdn'ed.

Adding a separate build for cudf would likely be much more expensive ci time wise... (As you will still build most of velox I assume?).

A separate build would be fine if we limit it to run when changes to the relevant source is made or something, though that might not catch issues on general velox PRs that impact cudf stuff.

I am mostly bringing this up, because we need larger runners to reliably build velox due to the high RAM usage and CI cost has been a concern before, maybe @kgpai or @pedroerp can chime in?

kgpai · 2025-03-04T22:36:15Z

docker-compose.yml

+      dockerfile: scripts/ubuntu-22.04-cuda-12.8-cpp.dockerfile
+    environment:
+      NUM_THREADS: 8 # default value for NUM_THREADS
+      VELOX_DEPENDENCY_SOURCE: BUNDLED # Build dependencies from source


If we are building an image here, would be better to take advantage of that and have dependencies be AUTO/SYSTEM (to not add to the build times) ? (any concerns with that @assignUser )

kgpai · 2025-03-04T22:48:28Z

I am mostly bringing this up, because we need larger runners to reliably build velox due to the high RAM usage and CI cost has been a concern before, maybe @kgpai or @pedroerp can chime in?

Lets try this out on a trial basis and I can keep an eye on the costs / stability etc. If it goes beyond our budget relative to the signal we get then maybe we can run it against main periodically rather than on every diff (presuming thats what we are planning to use this image for). I dont have any concerns if we are going to use it for periodic jobs.

kgpai · 2025-03-04T22:49:55Z

@assignUser, @bdice : Just an fyi , our largest CI costs are still Mac builds, and if this isnt going to use a dedicated gpu instance the increase in image size should mostly be ok imo (but I will keep an eye to be sure).

assignUser · 2025-03-10T17:16:35Z

Or that ^^

docker-compose.yml

scripts/adapters.dockerfile

…add-cuda-docker-images

devavret · 2025-03-10T18:07:10Z

scripts/adapters.dockerfile

+      cd build && \
+      source /opt/rh/gcc-toolset-12/enable && \
+      bash /setup-adapters.sh && \
+      bash /setup-centos9.sh install_cuda 12.8 \


Should installing cuda be hidden behind an arg to this dockerfile?

The version maybe but the image is for our ci where we always need cuda:)

assignUser

Thanks! The existing code seems to work well with 12.8, adapters build is 🟢

scripts/setup-cuda.sh

Yuhta · 2025-03-10T19:00:08Z

@bdice @assignUser Is this error relevant? https://github.com/facebookincubator/velox/actions/runs/13770933756/job/38511459315?pr=12413

assignUser · 2025-03-10T19:04:12Z

scripts/adapters.dockerfile

+      cd build && \
+      source /opt/rh/gcc-toolset-12/enable && \
+      bash /setup-adapters.sh && \
+      bash /setup-centos9.sh install_cuda 12.8 \


It seems that the version isn't picked up by the function? 1971.9 Error: Unable to find a match: cuda-compat- cuda-driver-devel- cuda-minimal-build- cuda-nvrtc-devel-

I fixed this to call install_cuda like the CI workflow does:

velox/.github/workflows/linux-build-base.yml

Lines 58 to 59 in c2eb47b

source scripts/setup-centos9.sh

install_cuda ${CUDA_VERSION}

Add CUDA Docker images

8a2eb19

bdice requested review from assignUser and majetideepak as code owners February 20, 2025 21:41

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 20, 2025

bdice mentioned this pull request Feb 20, 2025

Experimental RAPIDS cuDF Backend for Velox #12412

Open

bdice changed the title ~~Add Docker images for CUDA development~~ build: Add Docker images for CUDA development Feb 20, 2025

Fix image tag.

0a6da20

jinchengchenghh reviewed Mar 4, 2025

View reviewed changes

kgpai reviewed Mar 4, 2025

View reviewed changes

bdice added 6 commits March 10, 2025 11:47

Install CUDA into CentOS 9 adapters containers.

19f420a

Reduce diff.

9687683

Fix permissions.

9a539fb

Remove unused helper functions.

18f2133

Revert file changes.

706f873

Simplify CUDA setup with existing scripts.

08c6bcd

bdice commented Mar 10, 2025

View reviewed changes

docker-compose.yml Show resolved Hide resolved

docker-compose.yml Show resolved Hide resolved

scripts/adapters.dockerfile Show resolved Hide resolved

bdice added 3 commits March 10, 2025 12:23

Merge branch 'main' into add-cuda-docker-images

dd0e40e

Add version.

57eea9e

Merge branch 'add-cuda-docker-images' of github.com:bdice/velox into …

b622335

…add-cuda-docker-images

devavret reviewed Mar 10, 2025

View reviewed changes

assignUser approved these changes Mar 10, 2025

View reviewed changes

scripts/setup-cuda.sh Outdated Show resolved Hide resolved

scripts/setup-cuda.sh Outdated Show resolved Hide resolved

assignUser added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Mar 10, 2025

assignUser reviewed Mar 10, 2025

View reviewed changes

Call install_cuda like the CI workflow does.

a2fc5b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: Add Docker images for CUDA development #12413

build: Add Docker images for CUDA development #12413

bdice commented Feb 20, 2025 •

edited

Loading

netlify bot commented Feb 20, 2025 •

edited

Loading

assignUser commented Feb 22, 2025

bdice commented Feb 22, 2025

bdice commented Mar 3, 2025 •

edited

Loading

assignUser commented Mar 3, 2025

bdice commented Mar 3, 2025 •

edited

Loading

jinchengchenghh Mar 4, 2025 •

edited

Loading

assignUser commented Mar 4, 2025

bdice commented Mar 4, 2025

assignUser commented Mar 4, 2025

bdice commented Mar 4, 2025

assignUser commented Mar 4, 2025

kgpai Mar 4, 2025

kgpai commented Mar 4, 2025

kgpai commented Mar 4, 2025

assignUser commented Mar 10, 2025

devavret Mar 10, 2025 •

edited

Loading

assignUser Mar 10, 2025

assignUser left a comment

Yuhta commented Mar 10, 2025

assignUser Mar 10, 2025

bdice Mar 11, 2025

		@@ -0,0 +1,37 @@
		# Copyright (c) Facebook, Inc. and its affiliates.
		#

build: Add Docker images for CUDA development #12413

Are you sure you want to change the base?

build: Add Docker images for CUDA development #12413

Conversation

bdice commented Feb 20, 2025 • edited Loading

netlify bot commented Feb 20, 2025 • edited Loading

✅ Deploy Preview for meta-velox canceled.

assignUser commented Feb 22, 2025

bdice commented Feb 22, 2025

bdice commented Mar 3, 2025 • edited Loading

assignUser commented Mar 3, 2025

bdice commented Mar 3, 2025 • edited Loading

jinchengchenghh Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

assignUser commented Mar 4, 2025

bdice commented Mar 4, 2025

assignUser commented Mar 4, 2025

bdice commented Mar 4, 2025

assignUser commented Mar 4, 2025

kgpai Mar 4, 2025

Choose a reason for hiding this comment

kgpai commented Mar 4, 2025

kgpai commented Mar 4, 2025

assignUser commented Mar 10, 2025

devavret Mar 10, 2025 • edited Loading

Choose a reason for hiding this comment

assignUser Mar 10, 2025

Choose a reason for hiding this comment

assignUser left a comment

Choose a reason for hiding this comment

Yuhta commented Mar 10, 2025

assignUser Mar 10, 2025

Choose a reason for hiding this comment

bdice Mar 11, 2025

Choose a reason for hiding this comment

bdice commented Feb 20, 2025 •

edited

Loading

netlify bot commented Feb 20, 2025 •

edited

Loading

bdice commented Mar 3, 2025 •

edited

Loading

bdice commented Mar 3, 2025 •

edited

Loading

jinchengchenghh Mar 4, 2025 •

edited

Loading

devavret Mar 10, 2025 •

edited

Loading