Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended <cuda/bit> operations: bitfield_insert, bitfield_extract, bit_reverse, bitmask #3941

Merged
merged 40 commits into from
Mar 5, 2025
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
4e3ad46
add bitfield_insert and bitfield_extract
fbusato Feb 26, 2025
139ec3b
add bit_reverse
fbusato Feb 26, 2025
fe1e0ff
fix macros
fbusato Feb 26, 2025
3d85c5a
add bit_reverse header
fbusato Feb 26, 2025
113e3a7
add bitfield test
fbusato Feb 26, 2025
7ba0e87
fix bit_reverse implementation
fbusato Feb 26, 2025
be3a126
add bfe/bfi
fbusato Feb 26, 2025
836192c
fix asm statements
fbusato Feb 27, 2025
f3bf56f
try to remove macros for device code
fbusato Feb 27, 2025
3f6c2f5
protect asm statement from MSVC
fbusato Feb 27, 2025
8636b60
add documentation
fbusato Feb 27, 2025
4ecb37c
add bitmask function
fbusato Feb 27, 2025
fe63dd8
use bitmask and refactor bitfield_insert
fbusato Feb 27, 2025
3457737
refactor documentation
fbusato Feb 27, 2025
79d5c25
fix MSVC warning
fbusato Feb 27, 2025
31c8d93
fix mask generation in tests
fbusato Feb 27, 2025
45cfcb9
fix MSVC warning
fbusato Feb 28, 2025
8b91731
fix documentation typos
fbusato Feb 28, 2025
d612fe1
optmize shift
fbusato Feb 28, 2025
5a3544a
update docs
fbusato Feb 28, 2025
bc231d1
improve bitmask implementation
fbusato Feb 28, 2025
1238a7c
add new tests
fbusato Feb 28, 2025
2973b85
fix bfi
fbusato Feb 28, 2025
ee3e41d
add _CCCL_BUILTIN_BITREVERSE
fbusato Mar 3, 2025
f7f2532
fix _CCCL_BUILTIN_BITREVERSE
fbusato Mar 3, 2025
e870511
replace __CUDA_ARCH__ with __cccl_ptx_isa
fbusato Mar 3, 2025
20aaf70
increase _CCCL_HAS_INT128 scope
fbusato Mar 3, 2025
5b00c97
exclude CLANG from __builtin_bitreverse on device
fbusato Mar 3, 2025
d4d675c
modify _CCCL_BUILTIN_BITREVERSE32 guard
fbusato Mar 3, 2025
3c2963b
disable __builtin_bitreverse8 with nvcc
fbusato Mar 3, 2025
60f29d6
Add include
miscco Mar 4, 2025
0eaf2bf
Update libcudacxx/include/cuda/bit
fbusato Mar 4, 2025
e4186ed
Update docs/libcudacxx/extended_api/bit/bitmask.rst
fbusato Mar 4, 2025
47fc5ac
Update docs/libcudacxx/extended_api/bit/bitfield_insert.rst
fbusato Mar 4, 2025
34b0d48
Update docs/libcudacxx/extended_api/bit/bitfield_extract.rst
fbusato Mar 4, 2025
4b180ae
Update docs/libcudacxx/extended_api/bit/bitfield_extract.rst
fbusato Mar 4, 2025
98e52cd
Update docs/libcudacxx/extended_api/bit/bit_reverse.rst
fbusato Mar 4, 2025
3baa1eb
Update docs/libcudacxx/extended_api/bit/bitfield_insert.rst
fbusato Mar 4, 2025
15031ed
improve documentation
fbusato Mar 4, 2025
442beb5
Merge branch 'main' into extended-bit-operations
fbusato Mar 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/libcudacxx/extended_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Extended API
.. toctree::
:maxdepth: 2

extended_api/bit
extended_api/memory_model
extended_api/thread_groups
extended_api/shapes
Expand Down
37 changes: 37 additions & 0 deletions docs/libcudacxx/extended_api/bit.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
.. _libcudacxx-extended-api-bit:

Bit
===

.. toctree::
:hidden:
:maxdepth: 1

cuda::bitmask <bit/bitmask>
cuda::bit_reverse <bit/bit_reverse>
cuda::bitfield_insert <bit/bitfield_insert>
cuda::bitfield_extract <bit/bitfield_extract>

.. list-table::
:widths: 25 45 30 30
:header-rows: 0

* - :ref:`bitmask <libcudacxx-extended-api-bit-bitmask>`
- Generate a bitmask
- CCCL 3.0.0
- CUDA 13.0

* - :ref:`bit_reverse <libcudacxx-extended-api-bit-bit_reverse>`
- Reverse the order of bits
- CCCL 3.0.0
- CUDA 13.0

* - :ref:`bitfield_insert <libcudacxx-extended-api-bit-bitfield_insert>`
- Insert a bitfield
- CCCL 3.0.0
- CUDA 13.0

* - :ref:`bitfield_extract <libcudacxx-extended-api-bit-bitfield_extract>`
- Extract a bitfield
- CCCL 3.0.0
- CUDA 13.0
66 changes: 66 additions & 0 deletions docs/libcudacxx/extended_api/bit/bit_reverse.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
.. _libcudacxx-extended-api-bit-bit_reverse:

``bit_reverse``
===============

.. code:: cpp

template <typename T>
[[nodiscard]] constexpr T
bit_reverse(T value) noexcept;

The function reverses the order of bits in a value.

**Parameters**

- ``value``: Input value

**Return value**

- Value with reversed bits

**Mandates**

- ``T`` is an unsigned integer type.

**Performance considerations**

The function performs the following operations:

- Device:

- ``uint8_t`` ``uint16_t``: ``PRMT``, ``BREV``
- ``uint32_t``: ``BREV``
- ``uint64_t``: ``BREV`` x2, ``MOV`` x2
- ``uint128_t``: ``BREV`` x4, ``MOV`` x4

- Host: ``__builtin_bitreverse<N>`` with clang

.. note::

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.

.. note::

GCC <= 8 uses a slow path with more instructions even in CUDA

Example
-------

.. code:: cpp

#include <cuda/bit>
#include <cuda/std/cassert>

__global__ void bit_reverse_kernel() {
assert(bitfield_reverse(0u) == ~0u);
assert(bitfield_reverse(uint8_t{0b00001011}) == uint8_t{0b11010000});
}

int main() {
bit_reverse_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}

`See it on Godbolt 🔗 <https://godbolt.org/z/K36dvoh58>`_
69 changes: 69 additions & 0 deletions docs/libcudacxx/extended_api/bit/bitfield_extract.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
.. _libcudacxx-extended-api-bit-bitfield_extract:

``bitfield_extract``
====================

.. code:: cpp

template <typename T>
[[nodiscard]] constexpr T
bitfield_extract(T value, int start, int width) noexcept;

The function extracts a bitfield from a value and returns it in the lower bits.
``bitfield_extract()`` computes ``(value >> start) & bitfield``, where ``bitfield`` is a sequence of bits of width ``width``.

**Parameters**

- ``value``: The value to apply the bitfield.
- ``start``: Initial position of the bitfield.
- ``width``: Width of the bitfield.

**Return value**

- ``(value >> start) & bitfield``.

**Mandates**

- ``T`` is an unsigned integer type.

**Preconditions**

- ``start >= 0 && start < num_bits(T)``
- ``width > 0 && width <= num_bits(T)``
- ``start + width <= num_bits(T)``

**Performance considerations**

The function performs the following operations in CUDA for ``uint8_t``, ``uint16_t``, ``uint32_t``:

- ``SM < 70``: ``BFE``
- ``SM >= 70``: ``BMSK``, bitwise operation x2

.. note::

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.

.. note::

GCC <= 8 uses a slow path with more instructions even in CUDA

Example
-------

.. code:: cpp

#include <cuda/bit>
#include <cuda/std/cassert>

__global__ void bitfield_insert_kernel() {
assert(cuda::bitfield_extract(~0u, 0, 4) == 0b1111);
assert(cuda::bitfield_extract(0b1011000, 3, 4) == 0b1011);
}

int main() {
bitfield_insert_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}

`See it on Godbolt 🔗 <https://godbolt.org/z/WvqfG9nbP>`_
70 changes: 70 additions & 0 deletions docs/libcudacxx/extended_api/bit/bitfield_insert.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
.. _libcudacxx-extended-api-bit-bitfield_insert:

``bitfield_insert``
===================

.. code:: cpp

template <typename T>
[[nodiscard]] constexpr T
bitfield_insert(T dest, T source, int start, int width) noexcept;

The function extracts the lower bitfield of size ``width`` from ``source`` and inserts it into ``dest`` at position ``start``.

**Parameters**

- ``dest``: The value to insert the bitfield.
- ``source``: The value from which extract the bitfield.
- ``start``: Initial position of the bitfield.
- ``width``: Width of the bitfield.

**Return value**

- ``((value << start) & mask) | (source & ~mask)``, where ``mask`` is a bitmask of width ``width``.

**Mandates**

- ``T`` is an unsigned integer type.

**Preconditions**

- ``start >= 0 && start < num_bits(T)``
- ``width > 0 && width <= num_bits(T)``
- ``start + width <= num_bits(T)``

**Performance considerations**

The function performs the following operations in CUDA for ``uint8_t``, ``uint16_t``, ``uint32_t``:

- ``SM < 70``: ``BFI``
- ``SM >= 70``: ``BMSK``, bitwise operation x5

.. note::

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.

.. note::

GCC <= 8 uses a slow path with more instructions even in CUDA

Example
-------

.. code:: cpp

#include <cuda/bit>
#include <cuda/std/cassert>

__global__ void bitfield_insert_kernel() {
assert(cuda::bitfield_insert(0u, 0xFFFFu, 0, 4) == 0b1111);
assert(cuda::bitfield_insert(0u, 0xFFFFu, 3, 4) == 0b1111000);
assert(cuda::bitfield_insert(1u, 0xFFFFu, 3, 4) == 0b1111001);
}

int main() {
bitfield_insert_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}

`See it on Godbolt 🔗 <https://godbolt.org/z/Phs8czqes>`_
69 changes: 69 additions & 0 deletions docs/libcudacxx/extended_api/bit/bitmask.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
.. _libcudacxx-extended-api-bit-bitmask:

``bitmask``
===========

.. code:: cpp

template <typename T>
[[nodiscard]] constexpr T
bitmask(int start, int width) noexcept;

The function generates a bitmask of size ``width`` starting at position ``start``.

**Parameters**

- ``start``: starting position of the bitmask
- ``width``: width of the bitmask

**Return value**

- Bitmask of size ``width`` starting at ``start``

**Mandates**

- ``T`` is an unsigned integral type.

**Preconditions**

- ``start >= 0 && start < num_bits(T)``
- ``width > 0 && width <= num_bits(T)``
- ``start + width <= num_bits(T)``

**Performance considerations**

The function performs the following operations in device code:

- ``uint8_t``, ``uint16_t``, ``uint32_t``: ``BMSK``
- ``uint64_t``: ``SHL`` x4, ``UADD`` x2
- ``uint128_t``: ``SHL`` x8, ``UADD`` x4

.. note::

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.

.. note::

GCC <= 8 uses a slow path with more instructions even in CUDA

Example
-------

.. code:: cpp

#include <cuda/bit>
#include <cuda/std/cassert>
#include <cuda/std/cstdint>

__global__ void bitmask_kernel() {
assert(cuda::bitmask<uint32_t>(2, 4) == 0b111100u);
assert(cuda::bitmask<uint64_t>(1, 3) == uint64_t{0b1110});
}

int main() {
bitmask_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}

`See it on Godbolt 🔗 <https://godbolt.org/z/PPqP8rTPd>`_
Loading
Loading