-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454
feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454
Conversation
…sonal repository Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Thank you for contributing to the Autoware project! 🚧 If your pull request is in progress, switch it to draft mode. Please ensure:
|
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
…pointcloud changes after the first iteration Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
…ntcloud_preprocessing
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the amazing PR, these performance improvements are desperately needed.
I haven't checked the PR for functionality yet, but I'll leave my first round of comments here.
The main points I'd like to address are
- memory safety and idiomatic C++ (there is currently a lot of raw-pointer code which should be avoided whenever possible)
- modulatiry: currently the pipeline is hard-coded and all in one place. This makes the module hard to adapt to different projects, and hard to maintain individual modules in the pipeline.
Thank you for your time!
sensing/autoware_cuda_pointcloud_preprocessor/config/cuda_pointcloud_preprocessor.param.yaml
Outdated
Show resolved
Hide resolved
sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md
Outdated
Show resolved
Hide resolved
sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md
Outdated
Show resolved
Hide resolved
...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp
Outdated
Show resolved
Hide resolved
...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp
Outdated
Show resolved
Hide resolved
...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp
Outdated
Show resolved
Hide resolved
...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knzo25
Thank you very much for proposing a fantastic PR, and I'm sorry for taking a long time for the review. From a viewpoint of CUDA usage, I left some comments. I'd appreciate it if you could consider them.
} | ||
|
||
__global__ void transformPointsKernel( | ||
const InputPointType * input_points, InputPointType * output_points, int num_points, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const InputPointType * input_points, InputPointType * output_points, int num_points, | |
const InputPointType * __restrict__ input_points, InputPointType * output_points, int num_points, |
To enable "read-only data cache", I would suggest using __restrict__
for read-only input array. This suggestion can be applied to the other kernel arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knzo25 could you please double check all input arrays across kernels? I think for some of them __restrict__
keyword might be also applicable. I don't know if you missed it or skipped on purpose.
I unresolved this conversation, please resolve it again after reading this comment 🙏🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amadeuszsz
Ahh now I can reply here. Don't know why but before I could not
Answer: the extract kernel could indeed use restrict. The kernel alone, strictly speaking can not, but due to how the indexes are computed there is no problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 68d1e42
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Outdated
Show resolved
Hide resolved
...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu
Show resolved
Hide resolved
…loud-preprocessor.md Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com>
…oud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Could not answer this one, but the extract kernel could indeed use restrict. The kernel alone, strictly speaking can not, but due to how the indexes are computed there is no problem |
@amadeuszsz |
Sorry, I only experienced it once in several experiments, and the pointcloud itself at the time was valid (checked with ros2 topic echo) |
sensing/autoware_cuda_pointcloud_preprocessor/config/cuda_pointcloud_preprocessor.param.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9454 +/- ##
==========================================
+ Coverage 26.05% 26.23% +0.18%
==========================================
Files 1374 1387 +13
Lines 106351 107189 +838
Branches 40877 41227 +350
==========================================
+ Hits 27709 28124 +415
- Misses 75940 75996 +56
- Partials 2702 3069 +367
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
…cloud preprocessor (autowarefoundation#9454) * feat: moved the cuda pointcloud preprocessor and organized from a personal repository Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: fixed incorrect links Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: fixed dead links pt2 Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: fixed spelling errors Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: json schema fixes Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: removed comments and filled the fields Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * fix: fixed the adapter for the case when the number of points in the pointcloud changes after the first iteration Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * feat: used the cuda host allocators for aster host to device copies Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * Update sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com> * Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com> * Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com> * style(pre-commit): autofix * Update sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com> * Update sensing/autoware_cuda_pointcloud_preprocessor/README.md Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com> * Update sensing/autoware_cuda_pointcloud_preprocessor/README.md Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com> * Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com> * style(pre-commit): autofix * Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com> * style(pre-commit): autofix * Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com> * style(pre-commit): autofix * chore: fixed code compilation to reflect Hirabayashi-san's memory pool proposal Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * feat: generalized the number of crop boxes. For two at least, the new approach is actually faster Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: updated config, schema, and handled the null case in a specialized way Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * feat: moving the pointcloud organization into gpu Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * feat: reimplemented the organized pointcloud adapter in cuda. the only bottleneck is the H->D copy Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: removed redundant ternay operator Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: added a temporary memory check. the check will be unified in a later PR Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: refactored the structure to avoid large files Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: updated the copyright year Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * fix: fixed a bug in the undistortion kernel setup. validated it comparing it with the baseline Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: removed unused packages Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: removed mentions of the removed adapter Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: fixed missing autoware prefix Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * fix: missing assignment in else branch Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: added cuda/nvcc debug flags on debug builds Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: refactored parameters for the undistortion settings Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: removed unused headers Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: changed default crop box to no filtering at all Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * feat: added missing restrict keyword Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: spells Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: removed default destructor Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: ocd activated (spelling) Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: fixed the schema Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: improved readibility Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: added dummy crop box Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: added new repositories to ansible Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: CI/CD Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: more CI/CD Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: mode CI/CD. some linters are conflicting Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * style(pre-commit): autofix * chore: ignoring uncrustify Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: ignoring more uncrustify Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: missed one more uncrustify exception Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> * chore: added meta dep Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> --------- Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp> Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com> Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Amadeusz Szymko <amadeusz.szymko.2@tier4.jp>
Description
This PR is part of a series of PRs that aim to accelerate the Sensing/Perception pipeline through an appropriate use of CUDA.
List of PRs:
To use these branches, the following additions to the
autoware.repos
are necessary:Depending on your machine and how many nodes are in a container, the following branch may also be required:
https://github.com/knzo25/launch_ros/tree/fix/load_composable_node
There seems to be a but in ROS where if you send too many services at once some will be lost and
ros_launch
can not handle that.Related links
Parent Issue:
How was this PR tested?
The sensing/perception pipeline was tested until centerpoint for TIER IV's taxi using the logging simulator.
The following tests were executed in a laptop equipped with a RTX 4060 (laptop) GPU and a Intel(R) Core(TM) Ultra 7 165H (22 cores)
(2.03 are H->D copies)
10.26 speedup!
Notes for reviewers
The main branch that I used for development is
feat/cuda_acceleration_and_transport_layer
.However, the changes were too big so I split the PRs. That being said, development, if any will still be on that branch (and then cherrypicked to the respective PRs), and the review changes will be cherrypicked into the development branch.
Interface changes
An additional topic is added to perform type negotiation:
Example:
input/pointcloud
->input/pointcloud
andinput/pointcloud/cuda
Effects on system behavior
Enabling this preprocessing in the launchers should provide a much reduced latency and cpu usage (at the cost of a higher GPU usage)