-
Notifications
You must be signed in to change notification settings - Fork 200
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Extended
<cuda/bit>
operations: bitfield_insert
, `bitfield_extrac…
…t`, `bit_reverse`, `bitmask` (#3941) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
- Loading branch information
1 parent
4836c1a
commit b9e7d38
Showing
15 changed files
with
962 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
.. _libcudacxx-extended-api-bit: | ||
|
||
Bit | ||
=== | ||
|
||
.. toctree:: | ||
:hidden: | ||
:maxdepth: 1 | ||
|
||
cuda::bitmask <bit/bitmask> | ||
cuda::bit_reverse <bit/bit_reverse> | ||
cuda::bitfield_insert <bit/bitfield_insert> | ||
cuda::bitfield_extract <bit/bitfield_extract> | ||
|
||
.. list-table:: | ||
:widths: 25 45 30 30 | ||
:header-rows: 0 | ||
|
||
* - :ref:`bitmask <libcudacxx-extended-api-bit-bitmask>` | ||
- Generate a bitmask | ||
- CCCL 3.0.0 | ||
- CUDA 13.0 | ||
|
||
* - :ref:`bit_reverse <libcudacxx-extended-api-bit-bit_reverse>` | ||
- Reverse the order of bits | ||
- CCCL 3.0.0 | ||
- CUDA 13.0 | ||
|
||
* - :ref:`bitfield_insert <libcudacxx-extended-api-bit-bitfield_insert>` | ||
- Insert a bitfield | ||
- CCCL 3.0.0 | ||
- CUDA 13.0 | ||
|
||
* - :ref:`bitfield_extract <libcudacxx-extended-api-bit-bitfield_extract>` | ||
- Extract a bitfield | ||
- CCCL 3.0.0 | ||
- CUDA 13.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
.. _libcudacxx-extended-api-bit-bit_reverse: | ||
|
||
``bit_reverse`` | ||
=============== | ||
|
||
.. code:: cpp | ||
template <typename T> | ||
[[nodiscard]] constexpr T | ||
bit_reverse(T value) noexcept; | ||
The function reverses the order of bits in a value. | ||
|
||
**Parameters** | ||
|
||
- ``value``: Input value | ||
|
||
**Return value** | ||
|
||
- Value with reversed bits | ||
|
||
**Mandates** | ||
|
||
- ``T`` is an unsigned integer type. | ||
|
||
**Performance considerations** | ||
|
||
The function performs the following operations: | ||
|
||
- Device: | ||
|
||
- ``uint8_t`` ``uint16_t``: ``PRMT``, ``BREV`` | ||
- ``uint32_t``: ``BREV`` | ||
- ``uint64_t``: ``BREV`` x2, ``MOV`` x2 | ||
- ``uint128_t``: ``BREV`` x4, ``MOV`` x4 | ||
|
||
- Host: ``__builtin_bitreverse<N>`` with clang | ||
|
||
.. note:: | ||
|
||
When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal. | ||
|
||
.. note:: | ||
|
||
GCC <= 8 uses a slow path with more instructions even in CUDA | ||
|
||
Example | ||
------- | ||
|
||
.. code:: cpp | ||
#include <cuda/bit> | ||
#include <cuda/std/cassert> | ||
__global__ void bit_reverse_kernel() { | ||
assert(bitfield_reverse(0u) == ~0u); | ||
assert(bitfield_reverse(uint8_t{0b00001011}) == uint8_t{0b11010000}); | ||
} | ||
int main() { | ||
bit_reverse_kernel<<<1, 1>>>(); | ||
cudaDeviceSynchronize(); | ||
return 0; | ||
} | ||
`See it on Godbolt 🔗 <https://godbolt.org/z/K36dvoh58>`_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
.. _libcudacxx-extended-api-bit-bitfield_extract: | ||
|
||
``bitfield_extract`` | ||
==================== | ||
|
||
.. code:: cpp | ||
template <typename T> | ||
[[nodiscard]] constexpr T | ||
bitfield_extract(T value, int start, int width) noexcept; | ||
The function extracts a bitfield from a value and returns it in the lower bits. | ||
``bitfield_extract()`` computes ``(value >> start) & bitfield``, where ``bitfield`` is a sequence of bits of width ``width``. | ||
|
||
**Parameters** | ||
|
||
- ``value``: The value to apply the bitfield. | ||
- ``start``: Initial position of the bitfield. | ||
- ``width``: Width of the bitfield. | ||
|
||
**Return value** | ||
|
||
- ``(value >> start) & bitfield``. | ||
|
||
**Mandates** | ||
|
||
- ``T`` is an unsigned integer type. | ||
|
||
**Preconditions** | ||
|
||
- ``start >= 0 && start < num_bits(T)`` | ||
- ``width > 0 && width <= num_bits(T)`` | ||
- ``start + width <= num_bits(T)`` | ||
|
||
**Performance considerations** | ||
|
||
The function performs the following operations in CUDA for ``uint8_t``, ``uint16_t``, ``uint32_t``: | ||
|
||
- ``SM < 70``: ``BFE`` | ||
- ``SM >= 70``: ``BMSK``, bitwise operation x2 | ||
|
||
.. note:: | ||
|
||
When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal. | ||
|
||
.. note:: | ||
|
||
GCC <= 8 uses a slow path with more instructions even in CUDA | ||
|
||
Example | ||
------- | ||
|
||
.. code:: cpp | ||
#include <cuda/bit> | ||
#include <cuda/std/cassert> | ||
__global__ void bitfield_insert_kernel() { | ||
assert(cuda::bitfield_extract(~0u, 0, 4) == 0b1111); | ||
assert(cuda::bitfield_extract(0b1011000, 3, 4) == 0b1011); | ||
} | ||
int main() { | ||
bitfield_insert_kernel<<<1, 1>>>(); | ||
cudaDeviceSynchronize(); | ||
return 0; | ||
} | ||
`See it on Godbolt 🔗 <https://godbolt.org/z/WvqfG9nbP>`_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
.. _libcudacxx-extended-api-bit-bitfield_insert: | ||
|
||
``bitfield_insert`` | ||
=================== | ||
|
||
.. code:: cpp | ||
template <typename T> | ||
[[nodiscard]] constexpr T | ||
bitfield_insert(T dest, T source, int start, int width) noexcept; | ||
The function extracts the lower bitfield of size ``width`` from ``source`` and inserts it into ``dest`` at position ``start``. | ||
|
||
**Parameters** | ||
|
||
- ``dest``: The value to insert the bitfield. | ||
- ``source``: The value from which extract the bitfield. | ||
- ``start``: Initial position of the bitfield. | ||
- ``width``: Width of the bitfield. | ||
|
||
**Return value** | ||
|
||
- ``((value << start) & mask) | (source & ~mask)``, where ``mask`` is a bitmask of width ``width``. | ||
|
||
**Mandates** | ||
|
||
- ``T`` is an unsigned integer type. | ||
|
||
**Preconditions** | ||
|
||
- ``start >= 0 && start < num_bits(T)`` | ||
- ``width > 0 && width <= num_bits(T)`` | ||
- ``start + width <= num_bits(T)`` | ||
|
||
**Performance considerations** | ||
|
||
The function performs the following operations in CUDA for ``uint8_t``, ``uint16_t``, ``uint32_t``: | ||
|
||
- ``SM < 70``: ``BFI`` | ||
- ``SM >= 70``: ``BMSK``, bitwise operation x5 | ||
|
||
.. note:: | ||
|
||
When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal. | ||
|
||
.. note:: | ||
|
||
GCC <= 8 uses a slow path with more instructions even in CUDA | ||
|
||
Example | ||
------- | ||
|
||
.. code:: cpp | ||
#include <cuda/bit> | ||
#include <cuda/std/cassert> | ||
__global__ void bitfield_insert_kernel() { | ||
assert(cuda::bitfield_insert(0u, 0xFFFFu, 0, 4) == 0b1111); | ||
assert(cuda::bitfield_insert(0u, 0xFFFFu, 3, 4) == 0b1111000); | ||
assert(cuda::bitfield_insert(1u, 0xFFFFu, 3, 4) == 0b1111001); | ||
} | ||
int main() { | ||
bitfield_insert_kernel<<<1, 1>>>(); | ||
cudaDeviceSynchronize(); | ||
return 0; | ||
} | ||
`See it on Godbolt 🔗 <https://godbolt.org/z/Phs8czqes>`_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
.. _libcudacxx-extended-api-bit-bitmask: | ||
|
||
``bitmask`` | ||
=========== | ||
|
||
.. code:: cpp | ||
template <typename T> | ||
[[nodiscard]] constexpr T | ||
bitmask(int start, int width) noexcept; | ||
The function generates a bitmask of size ``width`` starting at position ``start``. | ||
|
||
**Parameters** | ||
|
||
- ``start``: starting position of the bitmask | ||
- ``width``: width of the bitmask | ||
|
||
**Return value** | ||
|
||
- Bitmask of size ``width`` starting at ``start`` | ||
|
||
**Mandates** | ||
|
||
- ``T`` is an unsigned integral type. | ||
|
||
**Preconditions** | ||
|
||
- ``start >= 0 && start < num_bits(T)`` | ||
- ``width > 0 && width <= num_bits(T)`` | ||
- ``start + width <= num_bits(T)`` | ||
|
||
**Performance considerations** | ||
|
||
The function performs the following operations in device code: | ||
|
||
- ``uint8_t``, ``uint16_t``, ``uint32_t``: ``BMSK`` | ||
- ``uint64_t``: ``SHL`` x4, ``UADD`` x2 | ||
- ``uint128_t``: ``SHL`` x8, ``UADD`` x4 | ||
|
||
.. note:: | ||
|
||
When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the function could not be optimal. | ||
|
||
.. note:: | ||
|
||
GCC <= 8 uses a slow path with more instructions even in CUDA | ||
|
||
Example | ||
------- | ||
|
||
.. code:: cpp | ||
#include <cuda/bit> | ||
#include <cuda/std/cassert> | ||
#include <cuda/std/cstdint> | ||
__global__ void bitmask_kernel() { | ||
assert(cuda::bitmask<uint32_t>(2, 4) == 0b111100u); | ||
assert(cuda::bitmask<uint64_t>(1, 3) == uint64_t{0b1110}); | ||
} | ||
int main() { | ||
bitmask_kernel<<<1, 1>>>(); | ||
cudaDeviceSynchronize(); | ||
return 0; | ||
} | ||
`See it on Godbolt 🔗 <https://godbolt.org/z/PPqP8rTPd>`_ |
Oops, something went wrong.