Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454

Conversation

knzo25
Copy link
Contributor

@knzo25 knzo25 commented Nov 25, 2024

Description

This PR is part of a series of PRs that aim to accelerate the Sensing/Perception pipeline through an appropriate use of CUDA.

List of PRs:

To use these branches, the following additions to the autoware.repos are necessary:

  vendor/cuda_blackboard:
    type: git
    url: git@github.com:knzo25/cuda_blackboard.git
    version: main
  vendor/negotiated:
    type: git
    url: https://github.com/osrf/negotiated.git
    version: master

Depending on your machine and how many nodes are in a container, the following branch may also be required:
https://github.com/knzo25/launch_ros/tree/fix/load_composable_node
There seems to be a but in ROS where if you send too many services at once some will be lost and ros_launch can not handle that.

Related links

Parent Issue:

  • Link

How was this PR tested?

The sensing/perception pipeline was tested until centerpoint for TIER IV's taxi using the logging simulator.
The following tests were executed in a laptop equipped with a RTX 4060 (laptop) GPU and a Intel(R) Core(TM) Ultra 7 165H (22 cores)

Node / processing time [ms] Current PR
/sensing/lidar/top/crop_box_filter_self/debug/processing_time_ms 5.81 N/A
/sensing/lidar/top/crop_box_filter_mirror/debug/processing_time_ms 4.59 N/A
/sensing/lidar/top/distortion_corrector/debug/processing_time_ms 10.96 N/A
/sensing/lidar/top/ring_outlier_filter/debug/processing_time_ms 10.69 N/A
/sensing/lidar/top/cuda_pointcloud_preprocessor/debug/processing_time_ms N/A 3.08
(2.03 are H->D copies)
/sensing/lidar/concatenate_data_synchronizer/debug/processing_time_ms 7.83 0.70
Total 38.8 3.78

10.26 speedup!

Notes for reviewers

The main branch that I used for development is feat/cuda_acceleration_and_transport_layer.
However, the changes were too big so I split the PRs. That being said, development, if any will still be on that branch (and then cherrypicked to the respective PRs), and the review changes will be cherrypicked into the development branch.

Interface changes

An additional topic is added to perform type negotiation:
Example: input/pointcloud -> input/pointcloud and input/pointcloud/cuda

Effects on system behavior

Enabling this preprocessing in the launchers should provide a much reduced latency and cpu usage (at the cost of a higher GPU usage)

…sonal repository

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
@github-actions github-actions bot added type:documentation Creating or refining documentation. (auto-assigned) component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) tag:require-cuda-build-and-test labels Nov 25, 2024
Copy link

github-actions bot commented Nov 25, 2024

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@knzo25 knzo25 self-assigned this Nov 25, 2024
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
…pointcloud changes after the first iteration

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Copy link
Contributor

@mojomex mojomex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the amazing PR, these performance improvements are desperately needed.

I haven't checked the PR for functionality yet, but I'll leave my first round of comments here.

The main points I'd like to address are

  • memory safety and idiomatic C++ (there is currently a lot of raw-pointer code which should be avoided whenever possible)
  • modulatiry: currently the pipeline is hard-coded and all in one place. This makes the module hard to adapt to different projects, and hard to maintain individual modules in the pipeline.

Thank you for your time!

Copy link
Contributor

@manato manato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knzo25
Thank you very much for proposing a fantastic PR, and I'm sorry for taking a long time for the review. From a viewpoint of CUDA usage, I left some comments. I'd appreciate it if you could consider them.

}

__global__ void transformPointsKernel(
const InputPointType * input_points, InputPointType * output_points, int num_points,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const InputPointType * input_points, InputPointType * output_points, int num_points,
const InputPointType * __restrict__ input_points, InputPointType * output_points, int num_points,

To enable "read-only data cache", I would suggest using __restrict__ for read-only input array. This suggestion can be applied to the other kernel arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knzo25 could you please double check all input arrays across kernels? I think for some of them __restrict__ keyword might be also applicable. I don't know if you missed it or skipped on purpose.
I unresolved this conversation, please resolve it again after reading this comment 🙏🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amadeuszsz
Ahh now I can reply here. Don't know why but before I could not

Answer: the extract kernel could indeed use restrict. The kernel alone, strictly speaking can not, but due to how the indexes are computed there is no problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 68d1e42

knzo25 and others added 2 commits January 10, 2025 18:29
…loud-preprocessor.md

Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com>
…oud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com>
knzo25 added 5 commits March 6, 2025 16:03
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
@knzo25
Copy link
Contributor Author

knzo25 commented Mar 6, 2025

@amadeuszsz

@knzo25 could you please double check all input arrays across kernels? I think for some of them restrict keyword might be also applicable. I don't know if you missed it or skipped on purpose.
I unresolved this conversation, please resolve it again after reading this comment 🙏🏻

Could not answer this one, but the extract kernel could indeed use restrict. The kernel alone, strictly speaking can not, but due to how the indexes are computed there is no problem

@knzo25
Copy link
Contributor Author

knzo25 commented Mar 6, 2025

@amadeuszsz
Regarding CI/CD, I have not executed in ob purpose since the blackboard is not yet added to autoware.
I will investigate the loop error tomorrow

@knzo25
Copy link
Contributor Author

knzo25 commented Mar 10, 2025

terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Aborted (core dumped)
It happens when timer is restarted (--loop) and only if output pointcloud from autoware_cuda_pointcloud_preprocessor is visualized. Could you please check if it happens to you too? If so, I would appreciate if you can investigate this issue 🙏🏻

Sorry, I only experienced it once in several experiments, and the pointcloud itself at the time was valid (checked with ros2 topic echo)

@knzo25 knzo25 requested review from mojomex and amadeuszsz March 10, 2025 07:33
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Copy link
Contributor

@amadeuszsz amadeuszsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
@knzo25 knzo25 added the run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) label Mar 18, 2025
knzo25 and others added 7 commits March 18, 2025 16:22
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Copy link

codecov bot commented Mar 18, 2025

Codecov Report

Attention: Patch coverage is 0% with 469 lines in your changes missing coverage. Please review.

Project coverage is 26.23%. Comparing base (7686e5a) to head (2a1ff8c).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...cloud_preprocessor/cuda_pointcloud_preprocessor.cu 0.00% 165 Missing ⚠️
...preprocessor/cuda_pointcloud_preprocessor_node.cpp 0.00% 124 Missing ⚠️
.../cuda_pointcloud_preprocessor/undistort_kernels.cu 0.00% 89 Missing ⚠️
...e/autoware/cuda_pointcloud_preprocessor/memory.hpp 0.00% 43 Missing ⚠️
...src/cuda_pointcloud_preprocessor/common_kernels.cu 0.00% 24 Missing ⚠️
...c/cuda_pointcloud_preprocessor/organize_kernels.cu 0.00% 17 Missing ⚠️
...rc/cuda_pointcloud_preprocessor/outlier_kernels.cu 0.00% 6 Missing ⚠️
...loud_preprocessor/cuda_pointcloud_preprocessor.hpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9454      +/-   ##
==========================================
+ Coverage   26.05%   26.23%   +0.18%     
==========================================
  Files        1374     1387      +13     
  Lines      106351   107189     +838     
  Branches    40877    41227     +350     
==========================================
+ Hits        27709    28124     +415     
- Misses      75940    75996      +56     
- Partials     2702     3069     +367     
Flag Coverage Δ *Carryforward flag
differential-cuda 0.00% <0.00%> (?)
total 26.38% <ø> (+0.33%) ⬆️ Carriedforward from 368009e

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@knzo25 knzo25 dismissed mojomex’s stale review March 18, 2025 10:33

Changes had already been addressed

@knzo25 knzo25 merged commit 660ae1a into autowarefoundation:main Mar 18, 2025
35 of 37 checks passed
tier4-autoware-private-bot bot pushed a commit to tier4/autoware_universe that referenced this pull request Mar 18, 2025
…cloud preprocessor (autowarefoundation#9454)

* feat: moved the cuda pointcloud preprocessor and organized from a personal repository

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: fixed incorrect links

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: fixed dead links pt2

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: fixed spelling errors

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: json schema fixes

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: removed comments and filled the fields

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* fix: fixed the adapter for the case when the number of points in the pointcloud changes after the first iteration

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* feat: used the cuda host allocators for aster host to device copies

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* Update sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md

Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com>

* Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com>

* Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com>

* style(pre-commit): autofix

* Update sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md

Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com>

* Update sensing/autoware_cuda_pointcloud_preprocessor/README.md

Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com>

* Update sensing/autoware_cuda_pointcloud_preprocessor/README.md

Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com>

* Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com>

* style(pre-commit): autofix

* Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com>

* style(pre-commit): autofix

* Update sensing/autoware_cuda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com>

* style(pre-commit): autofix

* chore: fixed code compilation to reflect Hirabayashi-san's  memory pool proposal

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* feat: generalized the number of crop boxes. For two at least, the new approach is actually faster

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: updated config, schema, and handled the null case in a specialized way

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* feat: moving the pointcloud organization into gpu

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* feat: reimplemented the organized pointcloud adapter in cuda. the only bottleneck is the H->D copy

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: removed redundant ternay operator

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: added a temporary memory check. the check will be unified in a later PR

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: refactored the structure to avoid large files

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: updated the copyright year

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* fix: fixed a bug in the undistortion kernel setup. validated it comparing it with the baseline

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: removed unused packages

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: removed mentions of the removed adapter

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: fixed missing autoware prefix

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* fix: missing assignment in else branch

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: added cuda/nvcc debug flags on debug builds

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: refactored parameters for the undistortion settings

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: removed unused headers

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: changed default crop box to no filtering at all

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* feat: added missing restrict keyword

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: spells

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: removed default destructor

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: ocd activated (spelling)

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: fixed the schema

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: improved readibility

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: added dummy crop box

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: added new repositories to ansible

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: CI/CD

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: more CI/CD

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: mode CI/CD. some linters are conflicting

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* style(pre-commit): autofix

* chore: ignoring uncrustify

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: ignoring more uncrustify

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: missed one more uncrustify exception

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

* chore: added meta dep

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

---------

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>
Co-authored-by: Max Schmeller <6088931+mojomex@users.noreply.github.com>
Co-authored-by: Manato Hirabayashi <3022416+manato@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Amadeusz Szymko <amadeusz.szymko.2@tier4.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) tag:require-cuda-build-and-test type:documentation Creating or refining documentation. (auto-assigned)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants