-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: Add Docker images for CUDA development #12413
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
As discussed we will likely run the build in the 'adapters' job which is based on a centos image, so installing 12.8 there by default would be the way to go (currently installed in the workflow). If you want these for local development that's fine or is there a reason to prefer ubuntu? |
Thanks @assignUser. I am more familiar with Ubuntu but I think we can make that change. I’ll work on that early next week. |
@assignUser It seems like the adapters images don't use CUDA. I would need to change the base image to something deriving from If there is a better way to do this, please let me know:
|
Hm,if you want to use a cuda base image that makes things a bit trickier. We install cuda using this function, currently in the workflow as it's so fast that I haven't had a reason to move it into the dockerfile but we can do that. Unless that is lacking in some way compared to a proper cuda base image? |
There's some additional logic in the You can see some of that here: https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.8.0/rockylinux9/base/Dockerfile?ref_type=heads |
@@ -0,0 +1,37 @@ | |||
# Copyright (c) Facebook, Inc. and its affiliates. | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meet this exception
docker build -t cudf-velox -f scripts/ubuntu-22.04-cuda-12.8-cpp.dockerfile ./
[+] Building 2.2s (2/2) FINISHED docker:default
=> [internal] load build definition from ubuntu-22.04-cuda-12.8-cpp.dockerfile 0.0s
=> => transferring dockerfile: 1.23kB 0.0s
=> ERROR [internal] load metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 2.0s
------
> [internal] load metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04:
------
ubuntu-22.04-cuda-12.8-cpp.dockerfile:18
--------------------
16 | ARG tz="Europe/Madrid"
17 |
18 | >>> FROM ${base}
19 |
20 | SHELL ["/bin/bash", "-o", "pipefail", "-c"]
--------------------
ERROR: failed to solve: nvidia/cuda:12.8.0-devel-ubuntu22.04: failed to resolve source metadata for docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/nvidia/cuda/manifests/sha256:54f18e2a8e1b3d03f77b9a6dc905533da46ac93a5513f10e8ba8e560db9fa5ab: 429 Too Many Requests - Server message: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
I know it is because docker hub has throttling for pulling the images. Is there any chance to avoid this exception?
@bdice I haven't had time to come to a final conclusion but I looked at image sizes and the rocky cuda image is ~500M vs ~250M for stream9. While that is a big difference until we work on our final images sizes it doesn't really matter xD As the centos9 image is ~2G and the adapters one is a whopping 7G. I'll have to investigate if we can reduce that size (#12527). But for this issue it means that imo we can just use the cuda base image for centos.dockerfile (with what ever additions you need added to the adapters image?). Rocky should be a good 1:1 replacement for stream9 afaik? |
Great! I will work on that.
Yes, it should be functionally equivalent. |
Cc @majetideepak @kgpai any concerns? |
I should clarify, now that I'm looking at the numbers in your comment again -- cuDF needs the Given that, would my original approach of having a separate image for CUDA development (rather than increasing the size of |
Ah, hm. Well my concern regarding the size is mostly about the time it takes CI to fetch it but using ghcr.io in gha is well cdn'ed. Adding a separate build for cudf would likely be much more expensive ci time wise... (As you will still build most of velox I assume?). A separate build would be fine if we limit it to run when changes to the relevant source is made or something, though that might not catch issues on general velox PRs that impact cudf stuff. I am mostly bringing this up, because we need larger runners to reliably build velox due to the high RAM usage and CI cost has been a concern before, maybe @kgpai or @pedroerp can chime in? |
docker-compose.yml
Outdated
dockerfile: scripts/ubuntu-22.04-cuda-12.8-cpp.dockerfile | ||
environment: | ||
NUM_THREADS: 8 # default value for NUM_THREADS | ||
VELOX_DEPENDENCY_SOURCE: BUNDLED # Build dependencies from source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are building an image here, would be better to take advantage of that and have dependencies be AUTO/SYSTEM (to not add to the build times) ? (any concerns with that @assignUser )
Lets try this out on a trial basis and I can keep an eye on the costs / stability etc. If it goes beyond our budget relative to the signal we get then maybe we can run it against main periodically rather than on every diff (presuming thats what we are planning to use this image for). I dont have any concerns if we are going to use it for periodic jobs. |
@assignUser, @bdice : Just an fyi , our largest CI costs are still Mac builds, and if this isnt going to use a dedicated gpu instance the increase in image size should mostly be ok imo (but I will keep an eye to be sure). |
Or that ^^ |
scripts/adapters.dockerfile
Outdated
cd build && \ | ||
source /opt/rh/gcc-toolset-12/enable && \ | ||
bash /setup-adapters.sh && \ | ||
bash /setup-centos9.sh install_cuda 12.8 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should installing cuda be hidden behind an arg to this dockerfile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version maybe but the image is for our ci where we always need cuda:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The existing code seems to work well with 12.8, adapters build is 🟢
scripts/adapters.dockerfile
Outdated
cd build && \ | ||
source /opt/rh/gcc-toolset-12/enable && \ | ||
bash /setup-adapters.sh && \ | ||
bash /setup-centos9.sh install_cuda 12.8 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the version isn't picked up by the function? 1971.9 Error: Unable to find a match: cuda-compat- cuda-driver-devel- cuda-minimal-build- cuda-nvrtc-devel-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this to call install_cuda
like the CI workflow does:
velox/.github/workflows/linux-build-base.yml
Lines 58 to 59 in c2eb47b
source scripts/setup-centos9.sh | |
install_cuda ${CUDA_VERSION} |
This PR is the first step towards adding a RAPIDS cuDF backend (#12412). It adds CUDA to the
adapters
CI images. This container will allow us to share a development environment that has CUDA compilers and libraries along with the existing Velox container infrastructure.