NVIDIA · fbusato · Mar 5, 2025 · Feb 26, 2025 · Feb 26, 2025 · Feb 26, 2025
@@ -6,6 +6,7 @@ Extended API
 .. toctree::
    :maxdepth: 2
 
+   extended_api/bit
    extended_api/memory_model
    extended_api/thread_groups
    extended_api/shapes

@@ -0,0 +1,37 @@
+.. _libcudacxx-extended-api-bit:
+
+Bit
+===
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   cuda::bitmask <bit/bitmask>
+   cuda::bit_reverse <bit/bit_reverse>
+   cuda::bitfield_insert <bit/bitfield_insert>
+   cuda::bitfield_extract <bit/bitfield_extract>
+
+.. list-table::
+   :widths: 25 45 30 30
+   :header-rows: 0
+
+   * - :ref:`bitmask <libcudacxx-extended-api-bit-bitmask>`
+     - Generate a bitmask
+     - CCCL 3.0.0
+     - CUDA 13.0
+
+   * - :ref:`bit_reverse <libcudacxx-extended-api-bit-bit_reverse>`
+     - Reverse the order of bits
+     - CCCL 3.0.0
+     - CUDA 13.0
+
+   * - :ref:`bitfield_insert <libcudacxx-extended-api-bit-bitfield_insert>`
+     - Insert a bitfield
+     - CCCL 3.0.0
+     - CUDA 13.0
+
+   * - :ref:`bitfield_extract <libcudacxx-extended-api-bit-bitfield_extract>`
+     - Extract a bitfield
+     - CCCL 3.0.0
+     - CUDA 13.0
@@ -0,0 +1,66 @@
+.. _libcudacxx-extended-api-bit-bit_reverse:
+
+``bit_reverse``
+===============
+
+.. code:: cpp
+
+   template <typename T>
+   [[nodiscard]] constexpr T
+   bit_reverse(T value) noexcept;
+
+The function reverses the order of bits in a value.
+
+**Parameters**
+
+- ``value``: Input value
+
+**Return value**
+
+- Value with reversed bits
+
+**Mandates**
+
+- ``T`` is an unsigned integer type.
+
+**Performance considerations**
+
+The function performs the following operations:
+
+- Device:
+
+    - ``uint8_t`` ``uint16_t``: ``PRMT``, ``BREV``
+    - ``uint32_t``: ``BREV``
+    - ``uint64_t``: ``BREV`` x2, ``MOV`` x2
+    - ``uint128_t``: ``BREV`` x4, ``MOV`` x4
+
+- Host: ``__builtin_bitreverse<N>`` with clang
+
+.. note::
+
+    When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.
+
+.. note::
+
+    GCC <= 8 uses a slow path with more instructions even in CUDA
+
+Example
+-------
+
+.. code:: cpp
+
+    #include <cuda/bit>
+    #include <cuda/std/cassert>
+
+    __global__ void bit_reverse_kernel() {
+        assert(bitfield_reverse(0u) == ~0u);
+        assert(bitfield_reverse(uint8_t{0b00001011}) == uint8_t{0b11010000});
+    }
+
+    int main() {
+        bit_reverse_kernel<<<1, 1>>>();
+        cudaDeviceSynchronize();
+        return 0;
+    }
+
+`See it on Godbolt 🔗 <https://godbolt.org/z/K36dvoh58>`_
@@ -0,0 +1,69 @@
+.. _libcudacxx-extended-api-bit-bitfield_extract:
+
+``bitfield_extract``
+====================
+
+.. code:: cpp
+
+   template <typename T>
+   [[nodiscard]] constexpr T
+   bitfield_extract(T value, int start, int width) noexcept;
+
+The function extracts a bitfield from a value and returns it in the lower bits.
+``bitfield_extract()`` computes ``(value >> start) & bitfield``, where ``bitfield`` is a sequence of bits of width ``width``.
+
+**Parameters**
+
+- ``value``: The value to apply the bitfield.
+- ``start``:  Initial position of the bitfield.
+- ``width``:  Width of the bitfield.
+
+**Return value**
+
+- ``(value >> start) & bitfield``.
+
+**Mandates**
+
+- ``T`` is an unsigned integer type.
+
+**Preconditions**
+
+    - ``start >= 0 && start < num_bits(T)``
+    - ``width >  0 && width <= num_bits(T)``
+    - ``start + width <= num_bits(T)``
+
+**Performance considerations**
+
+The function performs the following operations in CUDA for ``uint8_t``, ``uint16_t``, ``uint32_t``:
+
+- ``SM < 70``: ``BFE``
+- ``SM >= 70``: ``BMSK``, bitwise operation x2
+
+.. note::
+
+    When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.
+
+.. note::
+
+    GCC <= 8 uses a slow path with more instructions even in CUDA
+
+Example
+-------
+
+.. code:: cpp
+
+    #include <cuda/bit>
+    #include <cuda/std/cassert>
+
+    __global__ void bitfield_insert_kernel() {
+        assert(cuda::bitfield_extract(~0u, 0, 4) == 0b1111);
+        assert(cuda::bitfield_extract(0b1011000, 3, 4) == 0b1011);
+    }
+
+    int main() {
+        bitfield_insert_kernel<<<1, 1>>>();
+        cudaDeviceSynchronize();
+        return 0;
+    }
+
+`See it on Godbolt 🔗 <https://godbolt.org/z/WvqfG9nbP>`_
@@ -0,0 +1,70 @@
+.. _libcudacxx-extended-api-bit-bitfield_insert:
+
+``bitfield_insert``
+===================
+
+.. code:: cpp
+
+   template <typename T>
+   [[nodiscard]] constexpr T
+   bitfield_insert(T dest, T source, int start, int width) noexcept;
+
+The function extracts the lower bitfield of size ``width`` from ``source`` and inserts it into ``dest`` at position ``start``.
+
+**Parameters**
+
+- ``dest``:   The value to insert the bitfield.
+- ``source``: The value from which extract the bitfield.
+- ``start``:  Initial position of the bitfield.
+- ``width``:  Width of the bitfield.
+
+**Return value**
+
+- ``((value << start) & mask) | (source & ~mask)``, where ``mask`` is a bitmask of width ``width``.
+
+**Mandates**
+
+- ``T`` is an unsigned integer type.
+
+**Preconditions**
+
+    - ``start >= 0 && start < num_bits(T)``
+    - ``width >  0 && width <= num_bits(T)``
+    - ``start + width <= num_bits(T)``
+
+**Performance considerations**
+
+The function performs the following operations in CUDA for ``uint8_t``, ``uint16_t``, ``uint32_t``:
+
+- ``SM < 70``: ``BFI``
+- ``SM >= 70``: ``BMSK``, bitwise operation x5
+
+.. note::
+
+    When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.
+
+.. note::
+
+    GCC <= 8 uses a slow path with more instructions even in CUDA
+
+Example
+-------
+
+.. code:: cpp
+
+    #include <cuda/bit>
+    #include <cuda/std/cassert>
+
+    __global__ void bitfield_insert_kernel() {
+        assert(cuda::bitfield_insert(0u, 0xFFFFu, 0, 4) == 0b1111);
+        assert(cuda::bitfield_insert(0u, 0xFFFFu, 3, 4) == 0b1111000);
+        assert(cuda::bitfield_insert(1u, 0xFFFFu, 3, 4) == 0b1111001);
+    }
+
+    int main() {
+        bitfield_insert_kernel<<<1, 1>>>();
+        cudaDeviceSynchronize();
+        return 0;
+    }
+
+`See it on Godbolt 🔗 <https://godbolt.org/z/Phs8czqes>`_
@@ -0,0 +1,69 @@
+.. _libcudacxx-extended-api-bit-bitmask:
+
+``bitmask``
+===========
+
+.. code:: cpp
+
+   template <typename T>
+   [[nodiscard]] constexpr T
+   bitmask(int start, int width) noexcept;
+
+The function generates a bitmask of size ``width`` starting at position ``start``.
+
+**Parameters**
+
+- ``start``: starting position of the bitmask
+- ``width``: width of the bitmask
+
+**Return value**
+
+- Bitmask of size ``width`` starting at ``start``
+
+**Mandates**
+
+- ``T`` is an unsigned integral type.
+
+**Preconditions**
+
+    - ``start >= 0 && start < num_bits(T)``
+    - ``width >  0 && width <= num_bits(T)``
+    - ``start + width <= num_bits(T)``
+
+**Performance considerations**
+
+The function performs the following operations in device code:
+
+- ``uint8_t``, ``uint16_t``, ``uint32_t``: ``BMSK``
+- ``uint64_t``: ``SHL`` x4, ``UADD`` x2
+- ``uint128_t``: ``SHL`` x8, ``UADD`` x4
+
+.. note::
+
+    When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal.
+
+.. note::
+
+    GCC <= 8 uses a slow path with more instructions even in CUDA
+
+Example
+-------
+
+.. code:: cpp
+
+    #include <cuda/bit>
+    #include <cuda/std/cassert>
+    #include <cuda/std/cstdint>
+
+    __global__ void bitmask_kernel() {
+        assert(cuda::bitmask<uint32_t>(2, 4) == 0b111100u);
+        assert(cuda::bitmask<uint64_t>(1, 3) == uint64_t{0b1110});
+    }
+
+    int main() {
+        bitmask_kernel<<<1, 1>>>();
+        cudaDeviceSynchronize();
+        return 0;
+    }
+
+`See it on Godbolt 🔗 <https://godbolt.org/z/PPqP8rTPd>`_