-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA processing #150
Merged
Merged
Add CUDA processing #150
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This reverts commit d92084f.
The removed error handling in process_net was replaced with unwraps and explanatory comments. This results in a minor execution speedup for ONNX models.
The CUDA F16 backend does not seem to create any meaningful difference for our model run speeds. Using CUDA calls for min_max_loc drastically degrades performance compared to CPU calls. Quantization in OpenCV does not work with CUDA, so it is not particularly useful.
Includes two structs to cross FFIs, the required allocations and copies, the thread size calculation, etc.
Includes a few extra parameters and copy fixes. Code to test for equality is intentionally left as debug logic in this commit to preserve for later usage in a proper test routine.
The CUDA kernel achieves an ~50% speedup on my machine compared to the non-CUDA version.
Includes deletion of unused structs
This reverts commit ea8b36f.
CUDA requires a working NVCC compiler, and the CI runner doesn't have environment. Quantize i8 is excluded for being mostly useless.
Asynchronous memory functions are part of CUDA 11, but the latest CUDA supported on the Jetson Nano is CUDA 10. So we have to use the synchronous versions and take a small performance hit.
128 threads per block gets almost the same performance, but creates fewer leftover threads to be cleaned up.
No significant performance difference on my laptop.
Ignore the failing build, it's because the github action has not been updated to skip the CUDA flag yet. The free runners can't compile and run CUDA code. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously model processing was done entirely on CPU, as per OpenCV defaults. This adds the feature flag
cuda
, which uses a CUDA kernel port of our post processing code. The model produces a (relatively) large output to process into a small set of successes, and each potential success processes in complete isolation. Therefore the kernel demonstrates significant speedup (over 2x post processing speed), especially on systems with slow CPUs. CPU post-processing of model output is also slightly improved.OpenCV uses CUDA calls not supported on the Tegra architecture, so while using an OpenCV backend did speed up code on x86 devices it causes crashes on the Jetson. OpenCL should be explored as an alternative option, since using the GPU produced significant speedups.
Benchmarking with criterion was added to prove the speedups from using CUDA.
Full processing of an image through a model now takes about 600 ms on the Jetson Nano.