Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thrust::strided_iterator as a thrust::counting_iterator with step #4014

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Mar 4, 2025

Addresses a part of: #706

Fixes: #3698

@bernhardmgruber
Copy link
Contributor Author

I still need to figure out why the counting_iterator suddently has an output_iterator_tag


template <typename Incrementable, typename Stride, Stride Value>
inline _CCCL_HOST_DEVICE auto
make_counting_iterator(Incrementable x, ::cuda::std::integral_constant<Stride, Value> stride)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kudos: Making it support integral constant for statically known step sizes is a nice addition!

Comment on lines 318 to 319
auto iter = thrust::make_counting_iterator(
&arr[0].second, ::cuda::std::integral_constant<int, sizeof(std::pair<int, double>)>{});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test that it can take a runtime value

Comment on lines 189 to 201
struct empty
{};

template <typename T>
struct value_holder
{
T value;

_CCCL_HOST_DEVICE auto operator()() const
{
return value;
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should be able to avoid this by using the mdspan helper __de_ice available in <cuda\std\__mdspan\submdspan_helper.h>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly why I have put this draft PR up. I was certain, we would already have a helper somewhere. Thx!

@@ -53,7 +53,7 @@
THRUST_NAMESPACE_BEGIN

// forward declaration of counting_iterator
template <typename Incrementable, typename System, typename Traversal, typename Difference>
template <typename Incrementable, typename System, typename Traversal, typename Difference, typename Step>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should use the empty base class optimization here and inherit from Step We could also generalize __mdspan_ebco so that we do not store unneeded information

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't EBO get us in trouble with MSVC + nvcc and having different sizes in host/device code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if implemented through inheritance and not [[no_unique_address]]

@NVIDIA NVIDIA deleted a comment from copy-pr-bot bot Mar 5, 2025
Copy link

copy-pr-bot bot commented Mar 5, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber bernhardmgruber changed the title Add a step to thrust::counting_iterator Add thrust::strided_iterator as a thrust::counting_iterator with step Mar 5, 2025
Comment on lines 40 to 43
template <typename Iterator, typename StrideHolder = detail::empty>
using strided_iterator =
transform_iterator<detail::deref,
counting_iterator<Iterator, use_default, random_access_traversal_tag, use_default, StrideHolder>>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this definition is easy, I wonder whether we should still define a full-blown iterator without using transform and counting iterator, just to have fewer moving parts for the optimizer to fold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaaand we have to create a custom iterator, because of the proxy reference unwrapping in transform_iterator :)

@bernhardmgruber
Copy link
Contributor Author

/ok to test

Copy link
Contributor

github-actions bot commented Mar 5, 2025

🟨 CI finished in 1h 53m: Pass: 90%/93 | Total: 2d 20h | Avg: 43m 58s | Max: 1h 27m | Hits: 44%/123409
  • 🟨 thrust: Pass: 91%/45 | Total: 1d 00h | Avg: 33m 17s | Max: 1h 14m | Hits: 53%/73295

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  90%/43  | Total: 23h 55m | Avg: 33m 23s | Max:  1h 14m | Hits:  54%/69718 
      🟩 arm64              Pass: 100%/2   | Total:  1h 02m | Avg: 31m 01s | Max: 32m 18s | Hits:  44%/3577  
    🔍 ctk: 12.8 🔍
      🟩 12.0               Pass: 100%/5   | Total:  3h 18m | Avg: 39m 38s | Max:  1h 07m | Hits:  44%/8936  
      🟩 12.5               Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 12m | Hits:  21%/3576  
      🔍 12.8               Pass:  89%/38  | Total: 19h 20m | Avg: 30m 32s | Max:  1h 14m | Hits:  57%/60783 
    🔍 cudacxx: nvcc12.8 🔍
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 58m 40s | Avg: 29m 20s | Max: 30m 43s | Hits:  45%/3576  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 18m | Avg: 39m 38s | Max:  1h 07m | Hits:  44%/8936  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 12m | Hits:  21%/3576  
      🔍 nvcc12.8           Pass:  88%/36  | Total: 18h 21m | Avg: 30m 36s | Max:  1h 14m | Hits:  57%/57207 
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 58m 40s | Avg: 29m 20s | Max: 30m 43s | Hits:  45%/3576  
      🔍 nvcc               Pass:  90%/43  | Total: 23h 59m | Avg: 33m 28s | Max:  1h 14m | Hits:  54%/69719 
    🔍 cxx_family: GCC 🔍
      🟩 Clang              Pass: 100%/17  | Total:  8h 20m | Avg: 29m 25s | Max: 35m 31s | Hits:  54%/30396 
      🔍 GCC                Pass:  80%/21  | Total:  9h 07m | Avg: 26m 03s | Max: 36m 27s | Hits:  63%/30413 
      🟩 MSVC               Pass: 100%/5   | Total:  5h 11m | Avg:  1h 02m | Max:  1h 14m | Hits:  30%/8910  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 12m | Hits:  21%/3576  
    🔍 gpu: rtx2080 🔍
      🟩 h100               Pass: 100%/2   | Total: 32m 45s | Avg: 16m 22s | Max: 20m 56s | Hits:  72%/3578  
      🔍 rtx2080            Pass:  87%/33  | Total: 20h 12m | Avg: 36m 43s | Max:  1h 12m | Hits:  45%/51844 
      🟩 rtx4090            Pass: 100%/10  | Total:  4h 12m | Avg: 25m 17s | Max:  1h 14m | Hits:  75%/17873 
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  89%/38  | Total: 23h 23m | Avg: 36m 56s | Max:  1h 14m | Hits:  45%/60781 
      🟩 TestCPU            Pass: 100%/3   | Total: 49m 35s | Avg: 16m 31s | Max: 34m 42s | Hits:  90%/5359  
      🟩 TestGPU            Pass: 100%/4   | Total: 44m 43s | Avg: 11m 10s | Max: 11m 49s | Hits:  99%/7155  
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 09m | Avg: 32m 29s | Max: 32m 57s | Hits:  55%/7152  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 06m | Avg: 33m 01s | Max: 34m 22s | Hits:  44%/3576  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 07m | Avg: 33m 31s | Max: 35m 31s | Hits:  44%/3576  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 06m | Avg: 33m 08s | Max: 34m 24s | Hits:  44%/3576  
      🟩 Clang18            Pass: 100%/7   | Total:  2h 50m | Avg: 24m 25s | Max: 33m 44s | Hits:  63%/12516 
      🟩 GCC7               Pass: 100%/2   | Total:  1h 04m | Avg: 32m 20s | Max: 32m 31s | Hits:  56%/3578  
      🟩 GCC8               Pass: 100%/1   | Total: 31m 29s | Avg: 31m 29s | Max: 31m 29s | Hits:  44%/1789  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 07m | Avg: 33m 35s | Max: 33m 37s | Hits:  52%/3578  
      🟥 GCC10              Pass:   0%/2   | Total: 47m 04s | Avg: 23m 32s | Max: 23m 52s
      🟥 GCC11              Pass:   0%/2   | Total: 44m 05s | Avg: 22m 02s | Max: 22m 10s
      🟩 GCC12              Pass: 100%/2   | Total:  1h 09m | Avg: 34m 37s | Max: 36m 27s | Hits:  44%/3578  
      🟩 GCC13              Pass: 100%/10  | Total:  3h 43m | Avg: 22m 20s | Max: 33m 15s | Hits:  73%/17890 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 07m | Hits:  24%/3564  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 56m | Avg: 58m 47s | Max:  1h 14m | Hits:  33%/5346  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 12m | Hits:  21%/3576  
    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 42m 08s | Avg: 21m 04s | Max: 30m 51s | Hits:  72%/3578  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 32m 45s | Avg: 16m 22s | Max: 20m 56s | Hits:  72%/3578  
      🟩 90;90a;100         Pass: 100%/1   | Total: 33m 15s | Avg: 33m 15s | Max: 33m 15s | Hits:  76%/1789  
    🟨 std
      🟨 17                 Pass:  90%/20  | Total: 13h 01m | Avg: 39m 03s | Max:  1h 12m | Hits:  43%/32173 
      🟨 20                 Pass:  91%/23  | Total: 11h 14m | Avg: 29m 19s | Max:  1h 14m | Hits:  61%/37544 
    
  • 🟨 cub: Pass: 93%/45 | Total: 1d 18h | Avg: 57m 11s | Max: 1h 27m | Hits: 30%/49960

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  93%/43  | Total:  1d 16h | Avg: 56m 49s | Max:  1h 27m | Hits:  31%/47524 
      🟩 arm64              Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 05m | Hits:  22%/2436  
    🔍 ctk: 12.8 🔍
      🟩 12.0               Pass: 100%/5   | Total:  5h 31m | Avg:  1h 06m | Max:  1h 15m | Hits:  20%/5922  
      🟩 12.5               Pass: 100%/2   | Total:  2h 26m | Avg:  1h 13m | Max:  1h 15m | Hits:  18%/2254  
      🔍 12.8               Pass:  92%/38  | Total:  1d 10h | Avg: 55m 08s | Max:  1h 27m | Hits:  32%/41784 
    🔍 cudacxx: nvcc12.8 🔍
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 03m | Hits:  21%/2104  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 31m | Avg:  1h 06m | Max:  1h 15m | Hits:  20%/5922  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 26m | Avg:  1h 13m | Max:  1h 15m | Hits:  18%/2254  
      🔍 nvcc12.8           Pass:  91%/36  | Total:  1d 08h | Avg: 54m 41s | Max:  1h 27m | Hits:  33%/39680 
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 03m | Hits:  21%/2104  
      🔍 nvcc               Pass:  93%/43  | Total:  1d 16h | Avg: 56m 53s | Max:  1h 27m | Hits:  31%/47856 
    🚨 jobs: TestGPU 🚨
      🟩 Build              Pass: 100%/37  | Total:  1d 16h | Avg:  1h 04m | Max:  1h 27m | Hits:  21%/43870 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 35s | Avg: 21m 35s | Max: 21m 35s | Hits:  99%/1218  
      🟩 GraphCapture       Pass: 100%/1   | Total: 16m 53s | Avg: 16m 53s | Max: 16m 53s | Hits:  99%/1218  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 08m | Avg: 22m 50s | Max: 23m 28s | Hits:  99%/3654  
      🔥 TestGPU            Pass:   0%/3   | Total:  1h 04m | Avg: 21m 30s | Max: 23m 07s
    🔍 sm: 90 🔍
      🔍 90                 Pass:  66%/3   | Total:  1h 13m | Avg: 24m 30s | Max: 27m 15s | Hits:  60%/2436  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 18m | Avg:  1h 18m | Max:  1h 18m | Hits:  22%/1218  
    🔍 std: 20 🔍
      🟩 17                 Pass: 100%/20  | Total: 21h 58m | Avg:  1h 05m | Max:  1h 16m | Hits:  20%/23591 
      🔍 20                 Pass:  88%/25  | Total: 20h 54m | Avg: 50m 11s | Max:  1h 27m | Hits:  39%/26369 
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  4h 07m | Avg:  1h 01m | Max:  1h 05m | Hits:  22%/4880  
      🟩 Clang15            Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 01m | Hits:  22%/2436  
      🟩 Clang16            Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 04m | Hits:  22%/2436  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 59m | Avg: 59m 51s | Max:  1h 00m | Hits:  22%/2436  
      🟨 Clang18            Pass:  85%/7   | Total:  5h 58m | Avg: 51m 13s | Max:  1h 05m | Hits:  35%/6976  
      🟩 GCC7               Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m | Hits:  22%/2440  
      🟩 GCC8               Pass: 100%/1   | Total:  1h 04m | Avg:  1h 04m | Max:  1h 04m | Hits:  22%/1220  
      🟩 GCC9               Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 10m | Hits:  22%/2440  
      🟩 GCC10              Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  22%/2440  
      🟩 GCC11              Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m | Hits:  22%/2436  
      🟩 GCC12              Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 05m | Hits:  22%/2436  
      🟨 GCC13              Pass:  81%/11  | Total:  7h 11m | Avg: 39m 13s | Max:  1h 18m | Hits:  56%/10962 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 31m | Avg:  1h 15m | Max:  1h 15m | Hits:  12%/2084  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 43m | Avg:  1h 21m | Max:  1h 27m | Hits:  12%/2084  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 26m | Avg:  1h 13m | Max:  1h 15m | Hits:  18%/2254  
    🟨 cxx_family
      🟨 Clang              Pass:  94%/17  | Total: 16h 12m | Avg: 57m 13s | Max:  1h 05m | Hits:  27%/19164 
      🟨 GCC                Pass:  90%/22  | Total: 18h 59m | Avg: 51m 46s | Max:  1h 18m | Hits:  37%/24374 
      🟩 MSVC               Pass: 100%/4   | Total:  5h 15m | Avg:  1h 18m | Max:  1h 27m | Hits:  12%/4168  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 26m | Avg:  1h 13m | Max:  1h 15m | Hits:  18%/2254  
    🟨 gpu
      🟨 h100               Pass:  66%/3   | Total:  1h 13m | Avg: 24m 30s | Max: 27m 15s | Hits:  60%/2436  
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 13h | Avg:  1h 06m | Max:  1h 27m | Hits:  21%/40216 
      🟨 rtxa6000           Pass:  75%/8   | Total:  4h 10m | Avg: 31m 20s | Max:  1h 05m | Hits:  73%/7308  
    
  • 🟨 cccl_c_parallel: Pass: 50%/2 | Total: 11m 08s | Avg: 5m 34s | Max: 8m 39s | Hits: 95%/154

    🚨 jobs: Test 🚨
      🟩 Build              Pass: 100%/1   | Total:  2m 29s | Avg:  2m 29s | Max:  2m 29s | Hits:  95%/154   
      🔥 Test               Pass:   0%/1   | Total:  8m 39s | Avg:  8m 39s | Max:  8m 39s
    🟨 cpu
      🟨 amd64              Pass:  50%/2   | Total: 11m 08s | Avg:  5m 34s | Max:  8m 39s | Hits:  95%/154   
    🟨 ctk
      🟨 12.8               Pass:  50%/2   | Total: 11m 08s | Avg:  5m 34s | Max:  8m 39s | Hits:  95%/154   
    🟨 cudacxx
      🟨 nvcc12.8           Pass:  50%/2   | Total: 11m 08s | Avg:  5m 34s | Max:  8m 39s | Hits:  95%/154   
    🟨 cudacxx_family
      🟨 nvcc               Pass:  50%/2   | Total: 11m 08s | Avg:  5m 34s | Max:  8m 39s | Hits:  95%/154   
    🟨 cxx
      🟨 GCC13              Pass:  50%/2   | Total: 11m 08s | Avg:  5m 34s | Max:  8m 39s | Hits:  95%/154   
    🟨 cxx_family
      🟨 GCC                Pass:  50%/2   | Total: 11m 08s | Avg:  5m 34s | Max:  8m 39s | Hits:  95%/154   
    🟨 gpu
      🟨 rtx2080            Pass:  50%/2   | Total: 11m 08s | Avg:  5m 34s | Max:  8m 39s | Hits:  95%/154   
    
  • 🟥 python: Pass: 0%/1 | Total: 7m 08s | Avg: 7m 08s | Max: 7m 08s

    🟥 cpu
      🟥 amd64              Pass:   0%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s
    🟥 ctk
      🟥 12.8               Pass:   0%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s
    🟥 cudacxx
      🟥 nvcc12.8           Pass:   0%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s
    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s
    🟥 cxx
      🟥 GCC13              Pass:   0%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s
    🟥 cxx_family
      🟥 GCC                Pass:   0%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s
    🟥 gpu
      🟥 rtx2080            Pass:   0%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s
    🟥 jobs
      🟥 Test               Pass:   0%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 93)

# Runner
66 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA] Add iterator wrapping a T* with a custom element stride
3 participants