Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA processing #150

Merged
merged 36 commits into from
Jun 27, 2024
Merged

Add CUDA processing #150

merged 36 commits into from
Jun 27, 2024

Conversation

Bennett-Petzold
Copy link
Contributor

Previously model processing was done entirely on CPU, as per OpenCV defaults. This adds the feature flag cuda, which uses a CUDA kernel port of our post processing code. The model produces a (relatively) large output to process into a small set of successes, and each potential success processes in complete isolation. Therefore the kernel demonstrates significant speedup (over 2x post processing speed), especially on systems with slow CPUs. CPU post-processing of model output is also slightly improved.

OpenCV uses CUDA calls not supported on the Tegra architecture, so while using an OpenCV backend did speed up code on x86 devices it causes crashes on the Jetson. OpenCL should be explored as an alternative option, since using the GPU produced significant speedups.

Benchmarking with criterion was added to prove the speedups from using CUDA.

Full processing of an image through a model now takes about 600 ms on the Jetson Nano.

jimmy and others added 30 commits June 1, 2024 12:39
The removed error handling in process_net was replaced with unwraps and
explanatory comments. This results in a minor execution speedup for ONNX
models.
The CUDA F16 backend does not seem to create any meaningful difference
for our model run speeds. Using CUDA calls for min_max_loc drastically
degrades performance compared to CPU calls. Quantization in OpenCV does
not work with CUDA, so it is not particularly useful.
Includes two structs to cross FFIs, the required allocations and copies,
the thread size calculation, etc.
Includes a few extra parameters and copy fixes. Code to test for
equality is intentionally left as debug logic in this commit to preserve
for later usage in a proper test routine.
The CUDA kernel achieves an ~50% speedup on my machine compared to the
non-CUDA version.
Includes deletion of unused structs
CUDA requires a working NVCC compiler, and the CI runner doesn't have
environment. Quantize i8 is excluded for being mostly useless.
Asynchronous memory functions are part of CUDA 11, but the latest CUDA
supported on the Jetson Nano is CUDA 10. So we have to use the
synchronous versions and take a small performance hit.
128 threads per block gets almost the same performance, but creates
fewer leftover threads to be cleaned up.
@Bennett-Petzold
Copy link
Contributor Author

Ignore the failing build, it's because the github action has not been updated to skip the CUDA flag yet. The free runners can't compile and run CUDA code.

@Bennett-Petzold Bennett-Petzold disabled auto-merge June 27, 2024 15:22
@Bennett-Petzold Bennett-Petzold merged commit 0dc3140 into main Jun 27, 2024
4 of 5 checks passed
@Bennett-Petzold Bennett-Petzold deleted the cuda_speedup branch June 27, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant