Add CUDA processing #150

Bennett-Petzold · 2024-06-23T16:52:44Z

Previously model processing was done entirely on CPU, as per OpenCV defaults. This adds the feature flag cuda, which uses a CUDA kernel port of our post processing code. The model produces a (relatively) large output to process into a small set of successes, and each potential success processes in complete isolation. Therefore the kernel demonstrates significant speedup (over 2x post processing speed), especially on systems with slow CPUs. CPU post-processing of model output is also slightly improved.

OpenCV uses CUDA calls not supported on the Tegra architecture, so while using an OpenCV backend did speed up code on x86 devices it causes crashes on the Jetson. OpenCL should be explored as an alternative option, since using the GPU produced significant speedups.

Benchmarking with criterion was added to prove the speedups from using CUDA.

Full processing of an image through a model now takes about 600 ms on the Jetson Nano.

This reverts commit d92084f.

The removed error handling in process_net was replaced with unwraps and explanatory comments. This results in a minor execution speedup for ONNX models.

The CUDA F16 backend does not seem to create any meaningful difference for our model run speeds. Using CUDA calls for min_max_loc drastically degrades performance compared to CPU calls. Quantization in OpenCV does not work with CUDA, so it is not particularly useful.

Includes two structs to cross FFIs, the required allocations and copies, the thread size calculation, etc.

Includes a few extra parameters and copy fixes. Code to test for equality is intentionally left as debug logic in this commit to preserve for later usage in a proper test routine.

The CUDA kernel achieves an ~50% speedup on my machine compared to the non-CUDA version.

Includes deletion of unused structs

This reverts commit ea8b36f.

CUDA requires a working NVCC compiler, and the CI runner doesn't have environment. Quantize i8 is excluded for being mostly useless.

Asynchronous memory functions are part of CUDA 11, but the latest CUDA supported on the Jetson Nano is CUDA 10. So we have to use the synchronous versions and take a small performance hit.

128 threads per block gets almost the same performance, but creates fewer leftover threads to be cleaned up.

No significant performance difference on my laptop.

Bennett-Petzold · 2024-06-23T17:06:43Z

Ignore the failing build, it's because the github action has not been updated to skip the CUDA flag yet. The free runners can't compile and run CUDA code.

jimmy and others added 30 commits June 1, 2024 12:39

added gate and buoy models

e6fe8c2

changed classes for gate

f77736b

added missions for red and blue

d92084f

Revert "added missions for red and blue"

fe21015

This reverts commit d92084f.

Add comma

e9a9420

Gate pole adjustments

3e56091

Add CUDA toggle

57d357c

Add benchmark for gate pole model

5c7ba3f

Add profile and targets for benchmark perf

ad9840e

Remove runtime error handling for known non-errors

d2b8820

The removed error handling in process_net was replaced with unwraps and explanatory comments. This results in a minor execution speedup for ONNX models.

Remove more unnecessary runtime checks

96697a6

Add CUDA kernel build and linking architecture

fd14005

Define basis of Rust/CUDA interface

db9d440

Includes two structs to cross FFIs, the required allocations and copies, the thread size calculation, etc.

Flesh out CUDA kernel

2ebb5df

Includes a few extra parameters and copy fixes. Code to test for equality is intentionally left as debug logic in this commit to preserve for later usage in a proper test routine.

Remove debug and toggle between processing styles

d53a97c

The CUDA kernel achieves an ~50% speedup on my machine compared to the non-CUDA version.

Remove slow OpenCV call in process_net

e37301e

Fix linter warnings

c0653c5

Includes deletion of unused structs

Add CUDA toolkit to docs and graphs CI

ea8b36f

Revert "Add CUDA toolkit to docs and graphs CI"

8db92ba

This reverts commit ea8b36f.

Constrain features for docs and graphs CI

ee7f630

CUDA requires a working NVCC compiler, and the CI runner doesn't have environment. Quantize i8 is excluded for being mostly useless.

Remove asynchronous mem functions

3b09e15

Asynchronous memory functions are part of CUDA 11, but the latest CUDA supported on the Jetson Nano is CUDA 10. So we have to use the synchronous versions and take a small performance hit.

Add paste as a graphing dependency

c103754

Move synchronization to ensure kernel finishes before free

9a2d4c3

Tune MAX_THREADS down

036c2b9

128 threads per block gets almost the same performance, but creates fewer leftover threads to be cleaned up.

Merge branch 'main' into cuda_speedup

08c3b9f

Cleanup accidental Cargo.toml key

49ab149

Add saner build defaults

056d242

Make reqwest optional to avoid OpenSSL on Jetson

acb67f9

Add buoy model for benchmarking

6e51e8a

Bennett-Petzold added 5 commits June 14, 2024 12:42

Remove memory leaks in CUDA kernel

f4c9a73

Add const and __restrict__ quantifiers

45866e0

No significant performance difference on my laptop.

Remove reqwest compile time error

2a54d41

Fix Jetson compile and add test

8692294

Remove cudart search

05c3095

Bennett-Petzold enabled auto-merge June 23, 2024 16:53

Bennett-Petzold requested a review from fategg555 June 23, 2024 16:53

Merge branch 'main' into cuda_speedup

afdd05c

Bennett-Petzold temporarily deployed to github-pages June 26, 2024 16:35 — with GitHub Actions Inactive

Bennett-Petzold disabled auto-merge June 27, 2024 15:22

Bennett-Petzold merged commit 0dc3140 into main Jun 27, 2024
4 of 5 checks passed

Bennett-Petzold deleted the cuda_speedup branch June 27, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA processing #150

Add CUDA processing #150

Bennett-Petzold commented Jun 23, 2024

Bennett-Petzold commented Jun 23, 2024

Add CUDA processing #150

Add CUDA processing #150

Conversation

Bennett-Petzold commented Jun 23, 2024

Bennett-Petzold commented Jun 23, 2024