Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Add Docker images for CUDA development #12413

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Feb 20, 2025

This PR is the first step towards adding a RAPIDS cuDF backend (#12412). It adds CUDA to the adapters CI images. This container will allow us to share a development environment that has CUDA compilers and libraries along with the existing Velox container infrastructure.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 20, 2025
Copy link

netlify bot commented Feb 20, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a2fc5b3
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67d04eb6a27abb0008c46564

@bdice bdice changed the title Add Docker images for CUDA development build: Add Docker images for CUDA development Feb 20, 2025
@assignUser
Copy link
Collaborator

As discussed we will likely run the build in the 'adapters' job which is based on a centos image, so installing 12.8 there by default would be the way to go (currently installed in the workflow). If you want these for local development that's fine or is there a reason to prefer ubuntu?

@bdice
Copy link
Contributor Author

bdice commented Feb 22, 2025

Thanks @assignUser. I am more familiar with Ubuntu but I think we can make that change. I’ll work on that early next week.

@bdice
Copy link
Contributor Author

bdice commented Mar 3, 2025

@assignUser It seems like the adapters images don't use CUDA. I would need to change the base image to something deriving from nvidia/cuda (probably the base flavor, and install our own CUDA Toolkit) for the best driver compatibility. Currently this derives from ghcr.io/facebookincubator/velox-dev:centos9. I guess I would need to change the adapters.dockerfile to do the same setup from that centos container on top of an nvidia/cuda RHEL-compatible image, and then add the rest of the adapters. Is that still the desired path here?

If there is a better way to do this, please let me know:

  1. Edit adapters.dockerfile to start from nvidia/cuda:12.8.0-base-rockylinux9
  2. Copy-paste centos.dockerfile contents into adapters.dockerfile (I can't start from that velox-dev image, since I want a CUDA-friendly base)
  3. Add commands to install CUDA (and build/install cuDF into the image?)
  4. Run the setup-adapters.sh scripts to install everything else

@assignUser
Copy link
Collaborator

Hm,if you want to use a cuda base image that makes things a bit trickier. We install cuda using this function, currently in the workflow as it's so fast that I haven't had a reason to move it into the dockerfile but we can do that. Unless that is lacking in some way compared to a proper cuda base image?

@bdice
Copy link
Contributor Author

bdice commented Mar 3, 2025

There's some additional logic in the nvidia/cuda base images that makes driver compatibility much broader than what a "bare" image would have (using cuda-compat, NVIDIA_REQUIRE_CUDA environment variables, etc.). Recreating that in a separate image is possible but I would advise using nvidia/cuda base images instead.

You can see some of that here: https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/rockylinux9/base/Dockerfile?ref_type=heads

@@ -0,0 +1,37 @@
# Copyright (c) Facebook, Inc. and its affiliates.
#
Copy link
Contributor

@jinchengchenghh jinchengchenghh Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meet this exception

docker build -t cudf-velox -f scripts/ubuntu-22.04-cuda-12.8-cpp.dockerfile ./

[+] Building 2.2s (2/2) FINISHED                                                                                                                                                                docker:default
 => [internal] load build definition from ubuntu-22.04-cuda-12.8-cpp.dockerfile                                                                                                                           0.0s
 => => transferring dockerfile: 1.23kB                                                                                                                                                                    0.0s
 => ERROR [internal] load metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04                                                                                                                     2.0s
------
 > [internal] load metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04:
------
ubuntu-22.04-cuda-12.8-cpp.dockerfile:18
--------------------
  16 |     ARG tz="Europe/Madrid"
  17 |     
  18 | >>> FROM ${base}
  19 |     
  20 |     SHELL ["/bin/bash", "-o", "pipefail", "-c"]
--------------------
ERROR: failed to solve: nvidia/cuda:12.8.0-devel-ubuntu22.04: failed to resolve source metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/nvidia/cuda/manifests/sha256:54f18e2a8e1b3d03f77b9a6dc905533da46ac93a5513f10e8ba8e560db9fa5ab: 429 Too Many Requests - Server message: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit

I know it is because docker hub has throttling for pulling the images. Is there any chance to avoid this exception?

@assignUser
Copy link
Collaborator

@bdice I haven't had time to come to a final conclusion but I looked at image sizes and the rocky cuda image is ~500M vs ~250M for stream9. While that is a big difference until we work on our final images sizes it doesn't really matter xD As the centos9 image is ~2G and the adapters one is a whopping 7G. I'll have to investigate if we can reduce that size (#12527).

But for this issue it means that imo we can just use the cuda base image for centos.dockerfile (with what ever additions you need added to the adapters image?). Rocky should be a good 1:1 replacement for stream9 afaik?

@bdice
Copy link
Contributor Author

bdice commented Mar 4, 2025

Great! I will work on that.

Rocky should be a good 1:1 replacement for stream9 afaik?

Yes, it should be functionally equivalent.

@assignUser
Copy link
Collaborator

Cc @majetideepak @kgpai any concerns?

@bdice
Copy link
Contributor Author

bdice commented Mar 4, 2025

I should clarify, now that I'm looking at the numbers in your comment again -- cuDF needs the devel flavor of nvidia/cuda in order to build. The image nvidia/cuda:12.8.0-devel-rockylinux9 is about 4.62 GB compressed.

Given that, would my original approach of having a separate image for CUDA development (rather than increasing the size of adapters image) be worth considering?

@assignUser
Copy link
Collaborator

Ah, hm. Well my concern regarding the size is mostly about the time it takes CI to fetch it but using ghcr.io in gha is well cdn'ed.

Adding a separate build for cudf would likely be much more expensive ci time wise... (As you will still build most of velox I assume?).

A separate build would be fine if we limit it to run when changes to the relevant source is made or something, though that might not catch issues on general velox PRs that impact cudf stuff.

I am mostly bringing this up, because we need larger runners to reliably build velox due to the high RAM usage and CI cost has been a concern before, maybe @kgpai or @pedroerp can chime in?

dockerfile: scripts/ubuntu-22.04-cuda-12.8-cpp.dockerfile
environment:
NUM_THREADS: 8 # default value for NUM_THREADS
VELOX_DEPENDENCY_SOURCE: BUNDLED # Build dependencies from source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are building an image here, would be better to take advantage of that and have dependencies be AUTO/SYSTEM (to not add to the build times) ? (any concerns with that @assignUser )

@kgpai
Copy link
Contributor

kgpai commented Mar 4, 2025

I am mostly bringing this up, because we need larger runners to reliably build velox due to the high RAM usage and CI cost has been a concern before, maybe @kgpai or @pedroerp can chime in?

Lets try this out on a trial basis and I can keep an eye on the costs / stability etc. If it goes beyond our budget relative to the signal we get then maybe we can run it against main periodically rather than on every diff (presuming thats what we are planning to use this image for). I dont have any concerns if we are going to use it for periodic jobs.

@kgpai
Copy link
Contributor

kgpai commented Mar 4, 2025

@assignUser, @bdice : Just an fyi , our largest CI costs are still Mac builds, and if this isnt going to use a dedicated gpu instance the increase in image size should mostly be ok imo (but I will keep an eye to be sure).

@assignUser
Copy link
Collaborator

Or that ^^

cd build && \
source /opt/rh/gcc-toolset-12/enable && \
bash /setup-adapters.sh && \
bash /setup-centos9.sh install_cuda 12.8 \
Copy link

@devavret devavret Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should installing cuda be hidden behind an arg to this dockerfile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version maybe but the image is for our ci where we always need cuda:)

Copy link
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The existing code seems to work well with 12.8, adapters build is 🟢

@assignUser assignUser added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Mar 10, 2025
@Yuhta
Copy link
Contributor

Yuhta commented Mar 10, 2025

@bdice @assignUser Is this error relevant? https://github.com/facebookincubator/velox/actions/runs/13770933756/job/38511459315?pr=12413

cd build && \
source /opt/rh/gcc-toolset-12/enable && \
bash /setup-adapters.sh && \
bash /setup-centos9.sh install_cuda 12.8 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the version isn't picked up by the function? 1971.9 Error: Unable to find a match: cuda-compat- cuda-driver-devel- cuda-minimal-build- cuda-nvrtc-devel-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this to call install_cuda like the CI workflow does:

source scripts/setup-centos9.sh
install_cuda ${CUDA_VERSION}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants