Skip to content

Commit

Permalink
Extended <cuda/bit> operations: bitfield_insert, `bitfield_extrac…
Browse files Browse the repository at this point in the history
…t`, `bit_reverse`, `bitmask` (#3941)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
  • Loading branch information
3 people authored Mar 5, 2025
1 parent 4836c1a commit b9e7d38
Show file tree
Hide file tree
Showing 15 changed files with 962 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/libcudacxx/extended_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Extended API
.. toctree::
:maxdepth: 2

extended_api/bit
extended_api/execution_model
extended_api/memory_model
extended_api/thread_groups
Expand Down
37 changes: 37 additions & 0 deletions docs/libcudacxx/extended_api/bit.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
.. _libcudacxx-extended-api-bit:

Bit
===

.. toctree::
:hidden:
:maxdepth: 1

cuda::bitmask <bit/bitmask>
cuda::bit_reverse <bit/bit_reverse>
cuda::bitfield_insert <bit/bitfield_insert>
cuda::bitfield_extract <bit/bitfield_extract>

.. list-table::
:widths: 25 45 30 30
:header-rows: 0

* - :ref:`bitmask <libcudacxx-extended-api-bit-bitmask>`
- Generate a bitmask
- CCCL 3.0.0
- CUDA 13.0

* - :ref:`bit_reverse <libcudacxx-extended-api-bit-bit_reverse>`
- Reverse the order of bits
- CCCL 3.0.0
- CUDA 13.0

* - :ref:`bitfield_insert <libcudacxx-extended-api-bit-bitfield_insert>`
- Insert a bitfield
- CCCL 3.0.0
- CUDA 13.0

* - :ref:`bitfield_extract <libcudacxx-extended-api-bit-bitfield_extract>`
- Extract a bitfield
- CCCL 3.0.0
- CUDA 13.0
66 changes: 66 additions & 0 deletions docs/libcudacxx/extended_api/bit/bit_reverse.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
.. _libcudacxx-extended-api-bit-bit_reverse:

``bit_reverse``
===============

.. code:: cpp
template <typename T>
[[nodiscard]] constexpr T
bit_reverse(T value) noexcept;
The function reverses the order of bits in a value.

**Parameters**

- ``value``: Input value

**Return value**

- Value with reversed bits

**Mandates**

- ``T`` is an unsigned integer type.

**Performance considerations**

The function performs the following operations:

- Device:

- ``uint8_t`` ``uint16_t``: ``PRMT``, ``BREV``
- ``uint32_t``: ``BREV``
- ``uint64_t``: ``BREV`` x2, ``MOV`` x2
- ``uint128_t``: ``BREV`` x4, ``MOV`` x4

- Host: ``__builtin_bitreverse<N>`` with clang

.. note::

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.

.. note::

GCC <= 8 uses a slow path with more instructions even in CUDA

Example
-------

.. code:: cpp
#include <cuda/bit>
#include <cuda/std/cassert>
__global__ void bit_reverse_kernel() {
assert(bitfield_reverse(0u) == ~0u);
assert(bitfield_reverse(uint8_t{0b00001011}) == uint8_t{0b11010000});
}
int main() {
bit_reverse_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
`See it on Godbolt 🔗 <https://godbolt.org/z/K36dvoh58>`_
69 changes: 69 additions & 0 deletions docs/libcudacxx/extended_api/bit/bitfield_extract.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
.. _libcudacxx-extended-api-bit-bitfield_extract:

``bitfield_extract``
====================

.. code:: cpp
template <typename T>
[[nodiscard]] constexpr T
bitfield_extract(T value, int start, int width) noexcept;
The function extracts a bitfield from a value and returns it in the lower bits.
``bitfield_extract()`` computes ``(value >> start) & bitfield``, where ``bitfield`` is a sequence of bits of width ``width``.

**Parameters**

- ``value``: The value to apply the bitfield.
- ``start``: Initial position of the bitfield.
- ``width``: Width of the bitfield.

**Return value**

- ``(value >> start) & bitfield``.

**Mandates**

- ``T`` is an unsigned integer type.

**Preconditions**

- ``start >= 0 && start < num_bits(T)``
- ``width > 0 && width <= num_bits(T)``
- ``start + width <= num_bits(T)``

**Performance considerations**

The function performs the following operations in CUDA for ``uint8_t``, ``uint16_t``, ``uint32_t``:

- ``SM < 70``: ``BFE``
- ``SM >= 70``: ``BMSK``, bitwise operation x2

.. note::

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.

.. note::

GCC <= 8 uses a slow path with more instructions even in CUDA

Example
-------

.. code:: cpp
#include <cuda/bit>
#include <cuda/std/cassert>
__global__ void bitfield_insert_kernel() {
assert(cuda::bitfield_extract(~0u, 0, 4) == 0b1111);
assert(cuda::bitfield_extract(0b1011000, 3, 4) == 0b1011);
}
int main() {
bitfield_insert_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
`See it on Godbolt 🔗 <https://godbolt.org/z/WvqfG9nbP>`_
70 changes: 70 additions & 0 deletions docs/libcudacxx/extended_api/bit/bitfield_insert.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
.. _libcudacxx-extended-api-bit-bitfield_insert:

``bitfield_insert``
===================

.. code:: cpp
template <typename T>
[[nodiscard]] constexpr T
bitfield_insert(T dest, T source, int start, int width) noexcept;
The function extracts the lower bitfield of size ``width`` from ``source`` and inserts it into ``dest`` at position ``start``.

**Parameters**

- ``dest``: The value to insert the bitfield.
- ``source``: The value from which extract the bitfield.
- ``start``: Initial position of the bitfield.
- ``width``: Width of the bitfield.

**Return value**

- ``((value << start) & mask) | (source & ~mask)``, where ``mask`` is a bitmask of width ``width``.

**Mandates**

- ``T`` is an unsigned integer type.

**Preconditions**

- ``start >= 0 && start < num_bits(T)``
- ``width > 0 && width <= num_bits(T)``
- ``start + width <= num_bits(T)``

**Performance considerations**

The function performs the following operations in CUDA for ``uint8_t``, ``uint16_t``, ``uint32_t``:

- ``SM < 70``: ``BFI``
- ``SM >= 70``: ``BMSK``, bitwise operation x5

.. note::

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.

.. note::

GCC <= 8 uses a slow path with more instructions even in CUDA

Example
-------

.. code:: cpp
#include <cuda/bit>
#include <cuda/std/cassert>
__global__ void bitfield_insert_kernel() {
assert(cuda::bitfield_insert(0u, 0xFFFFu, 0, 4) == 0b1111);
assert(cuda::bitfield_insert(0u, 0xFFFFu, 3, 4) == 0b1111000);
assert(cuda::bitfield_insert(1u, 0xFFFFu, 3, 4) == 0b1111001);
}
int main() {
bitfield_insert_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
`See it on Godbolt 🔗 <https://godbolt.org/z/Phs8czqes>`_
69 changes: 69 additions & 0 deletions docs/libcudacxx/extended_api/bit/bitmask.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
.. _libcudacxx-extended-api-bit-bitmask:

``bitmask``
===========

.. code:: cpp
template <typename T>
[[nodiscard]] constexpr T
bitmask(int start, int width) noexcept;
The function generates a bitmask of size ``width`` starting at position ``start``.

**Parameters**

- ``start``: starting position of the bitmask
- ``width``: width of the bitmask

**Return value**

- Bitmask of size ``width`` starting at ``start``

**Mandates**

- ``T`` is an unsigned integral type.

**Preconditions**

- ``start >= 0 && start < num_bits(T)``
- ``width > 0 && width <= num_bits(T)``
- ``start + width <= num_bits(T)``

**Performance considerations**

The function performs the following operations in device code:

- ``uint8_t``, ``uint16_t``, ``uint32_t``: ``BMSK``
- ``uint64_t``: ``SHL`` x4, ``UADD`` x2
- ``uint128_t``: ``SHL`` x8, ``UADD`` x4

.. note::

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.

.. note::

GCC <= 8 uses a slow path with more instructions even in CUDA

Example
-------

.. code:: cpp
#include <cuda/bit>
#include <cuda/std/cassert>
#include <cuda/std/cstdint>
__global__ void bitmask_kernel() {
assert(cuda::bitmask<uint32_t>(2, 4) == 0b111100u);
assert(cuda::bitmask<uint64_t>(1, 3) == uint64_t{0b1110});
}
int main() {
bitmask_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
`See it on Godbolt 🔗 <https://godbolt.org/z/PPqP8rTPd>`_
Loading

0 comments on commit b9e7d38

Please sign in to comment.