From e8f5ce321c94c5f2e5f40d1f037b302ffe92cf11 Mon Sep 17 00:00:00 2001 From: David Galiffi Date: Wed, 15 May 2024 18:32:59 -0400 Subject: [PATCH] Starting to fix linting errors in markdown files. --- .markdownlint.yaml | 10 + Applications/README.md | 13 +- Dockerfiles/README.md | 15 +- Docs/CONTRIBUTING.md | 54 +++- HIP-Basic/README.md | 13 +- HIP-Basic/saxpy/README.md | 21 +- HIP-Basic/shared_memory/README.md | 16 +- HIP-Basic/static_device_library/README.md | 15 ++ HIP-Basic/static_host_library/README.md | 20 ++ HIP-Basic/streams/README.md | 5 + HIP-Basic/texture_management/README.md | 9 +- HIP-Basic/vulkan_interop/README.md | 112 ++++---- HIP-Basic/warp_shuffle/README.md | 5 + LICENSE.md | 2 +- ...-Sanitizer-with-a-Short-HIP-Application.md | 19 +- Libraries/hipBLAS/README.md | 14 +- .../hipBLAS/gemm_strided_batched/README.md | 69 ++--- Libraries/hipBLAS/her/README.md | 29 ++- Libraries/hipBLAS/scal/README.md | 19 +- README.md | 246 ++++++++++-------- 20 files changed, 451 insertions(+), 255 deletions(-) create mode 100644 .markdownlint.yaml diff --git a/.markdownlint.yaml b/.markdownlint.yaml new file mode 100644 index 000000000..253e6e474 --- /dev/null +++ b/.markdownlint.yaml @@ -0,0 +1,10 @@ +MD013: false +MD024: + siblings_only: true +MD026: + punctuation: ".,;:!" +MD029: + style: ordered +MD033: false +MD034: false +MD041: false diff --git a/Applications/README.md b/Applications/README.md index 626d275df..ac757541d 100644 --- a/Applications/README.md +++ b/Applications/README.md @@ -1,26 +1,33 @@ # Applications Examples ## Summary + The examples in this subdirectory showcase several GPU-implementations of finance, computer science, physics, etc. models or algorithms that additionally offer a command line application. The examples are build on Linux for the ROCm (AMD GPU) backend. Some examples additionally support the CUDA (NVIDIA GPU) backend. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - OR GNU Make - available via the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - - The Visual Studio ROCm extension needs to be installed to build with the solution files. + - The Visual Studio ROCm extension needs to be installed to build with the solution files. - [CMake](https://cmake.org/download/) (optional, to build with CMake. Requires at least version 3.21) - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building + ### Linux + Make sure that the dependencies are installed, or use one of the [provided Dockerfiles](../../Dockerfiles/) to build and run the examples in a containerized environment. #### Using CMake + All examples in the `Applications` subdirectory can either be built by a single CMake project or be built independently. - `$ cd Libraries/Applications` @@ -28,16 +35,20 @@ All examples in the `Applications` subdirectory can either be built by a single - `$ cmake --build build` #### Using Make + All examples can be built by a single invocation to Make or be built independently. - `$ cd Libraries/Applications` - `$ make` (on ROCm) or `$ make GPU_RUNTIME=CUDA` (on CUDA, when supported) ### Windows + #### Visual Studio + Visual Studio solution files are available for the individual examples. To build all supported HIP runtime examples open the top level solution file [ROCm-Examples-VS2019.sln](../../ROCm-Examples-VS2019.sln) and filter for Applications. For more detailed build instructions refer to the top level [README.md](../../README.md#visual-studio). #### CMake + All examples in the `Applications` subdirectory can either be built by a single CMake project or be built independently. For build instructions refer to the top-level [README.md](../../README.md#cmake-2). diff --git a/Dockerfiles/README.md b/Dockerfiles/README.md index cfe347da0..b61ee2049 100644 --- a/Dockerfiles/README.md +++ b/Dockerfiles/README.md @@ -4,22 +4,27 @@ This folder hosts Dockerfiles with ready-to-use environments for the various sam Each sample describes which environment it can be used with. ## Building + From this folder execute -``` + +``` bash docker build . -f -t ``` ## List of Dockerfiles + ### HIP libraries on the ROCm platform based on Ubuntu + Dockerfile: [hip-libraries-rocm-ubuntu.Dockerfile](hip-libraries-rocm-ubuntu.Dockerfile) -This is environment is based on Ubuntu targeting the ROCm platform. It has the HIP runtime and -the ROCm libraries installed. CMake is also installed in the image. +This is environment is based on Ubuntu targeting the ROCm platform. It has the +HIP runtime and the ROCm libraries installed. CMake is also installed in the image. It can be used with most of the samples when running on a ROCm target. ### HIP libraries on the CUDA platform based on Ubuntu + Dockerfile: [hip-libraries-cuda-ubuntu.Dockerfile](hip-libraries-cuda-ubuntu.Dockerfile) -This is environment is based on Ubuntu targeting the CUDA platform. It has the HIP runtime and -the ROCm libraries installed. CMake is also installed in the image. +This is environment is based on Ubuntu targeting the CUDA platform. It has the +HIP runtime and the ROCm libraries installed. CMake is also installed in the image. It can be used with the samples that support the CUDA target. diff --git a/Docs/CONTRIBUTING.md b/Docs/CONTRIBUTING.md index 24c5d2a8d..89adf47a4 100644 --- a/Docs/CONTRIBUTING.md +++ b/Docs/CONTRIBUTING.md @@ -1,33 +1,61 @@ # Guidelines -To keep the style of the examples consistent, please follow the following guidelines when implementing your example. + +To keep the style of the examples consistent, please follow the following +guidelines when implementing your example. ## Make/CMake -Each example has to at least support `CMake` as build system. The simpler examples should also support `Make`.
-Every example has to be able to be built separately from the others, but also has to be added to the top-level build scripts. + +Each example has to at least support `CMake` as build system. +The simpler examples should also support `Make`.
+Every example has to be able to be built separately from the others, +but also has to be added to the top-level build scripts. ## Code Format -The formatting rules of the examples are enforced by `clang-format` using the `.clang-format` file in the top-level directory. + +The formatting rules of the examples are enforced by `clang-format` using the +`.clang-format` file in the top-level directory. ## Variable Naming Conventions -- Use `lower_snake_case` style to name variables and functions (e.g. block_size, multiply_kernel and multiply_host). + +- Use `lower_snake_case` style to name variables and functions (e.g. block_size, +multiply_kernel and multiply_host). - Use `PascalCase` for `class`, `struct`, `enum` and template argument definitions. ## File and Directory Naming Conventions + - Top-level directories use `PascalCase`. -- The directories in Libraries/ should use the exact name of the library they represent, including casing. If any directory does not represent a library, it should named in `camelCase`. +- The directories in Libraries/ should use the exact name of the library they +represent, including casing. If any directory does not represent a library, it +should named in `camelCase`. - Directories for individual examples use `snake_case`. -- Files generally use `snake_case`, with the exception of files for which an existing convention already applies (`README.md`, `LICENSE.md`, `CMakeLists.txt`, etc). -- Example binaries should be prefixed with the library name of the binary, so that there are no conflicts between libraries (e.g. `hipcub_device_sum` and `rocprim_device_sum`). +- Files generally use `snake_case`, with the exception of files for which an +existing convention already applies (`README.md`, `LICENSE.md`, `CMakeLists.txt`, + etc). +- Example binaries should be prefixed with the library name of the binary, so +hat there are no conflicts between libraries (e.g. `hipcub_device_sum` and +`rocprim_device_sum`). ## Utilities -Utility-functions (printing vectors, etc) and common error-handling code, that is used by all examples, should be moved to the common utility-header [example_utils.hpp](../Common/example_utils.hpp). + +Utility-functions (printing vectors, etc) and common error-handling code, that +is used by all examples, should be moved to the common utility-header +[example_utils.hpp](../Common/example_utils.hpp). ## Error Handling -Error checking and handling should be applied where appropriate, e.g. when handling user input. `HIP_CHECK` should be used whenever possible. Exceptions should only be used if the complexity of the program requires it.
-In most cases printing an explanation to stderr and terminating the program with an error code, as specified in the common header, is sufficient. + +Error checking and handling should be applied where appropriate, e.g. when +handling user input. `HIP_CHECK` should be used whenever possible. Exceptions +should only be used if the complexity of the program requires it.
+In most cases printing an explanation to stderr and terminating the program with +an error code, as specified in the common header, is sufficient. ## Printing Intermediate Results -Results should be printed when they are helpful for the understanding and showcasing the example. However the output shouldn't be overwhelming, printing a vector with hundreds of entries is usually not useful. + +Results should be printed when they are helpful for the understanding and +showcasing the example. However the output shouldn't be overwhelming, printing +a vector with hundreds of entries is usually not useful. ## .gitignore -A .gitignore file is required in every example subdirectory to exclude the binary generated when using Make. + +A .gitignore file is required in every example subdirectory to exclude the +binary generated when using Make. diff --git a/HIP-Basic/README.md b/HIP-Basic/README.md index 3f79faf0d..26881fadf 100644 --- a/HIP-Basic/README.md +++ b/HIP-Basic/README.md @@ -1,26 +1,33 @@ # HIP-Basic Examples ## Summary + The examples in this subdirectory showcase the functionality of the HIP runtime. The examples build on Linux for the ROCm (AMD GPU) backend. Some examples additionally support Windows, some examples additionally support the CUDA (NVIDIA GPU) backend. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - OR GNU Make - available via the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - - The Visual Studio ROCm extension needs to be installed to build with the solution files. + - The Visual Studio ROCm extension needs to be installed to build with the solution files. - [CMake](https://cmake.org/download/) (optional, to build with CMake. Requires at least version 3.21) - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building + ### Linux + Make sure that the dependencies are installed, or use one of the [provided Dockerfiles](../../Dockerfiles/) to build and run the examples in a containerized environment. #### Using CMake + All examples in the `HIP-Basic` subdirectory can either be built by a single CMake project or be built independently. - `$ cd Libraries/HIP-Basic` @@ -28,18 +35,22 @@ All examples in the `HIP-Basic` subdirectory can either be built by a single CMa - `$ cmake --build build` #### Using Make + All examples can be built by a single invocation to Make or be built independently. - `$ cd Libraries/HIP-Basic` - `$ make` (on ROCm) or `$ make GPU_RUNTIME=CUDA` (on CUDA, when supported) ### Windows + Not all HIP runtime examples support building on Windows. See the README file in the directory of the example for more details. #### Visual Studio + Visual Studio solution files are available for the individual examples. To build all supported HIP runtime examples open the top level solution file [ROCm-Examples-VS2019.sln](../../ROCm-Examples-VS2019.sln) and filter for HIP-Basic. For more detailed build instructions refer to the top level [README.md](../../README.md#visual-studio). #### CMake + All examples in the `HIP-Basic` subdirectory can either be built by a single CMake project or be built independently. For build instructions refer to the top-level [README.md](../../README.md#cmake-2). diff --git a/HIP-Basic/saxpy/README.md b/HIP-Basic/saxpy/README.md index 7cf124ee5..f46e151eb 100644 --- a/HIP-Basic/saxpy/README.md +++ b/HIP-Basic/saxpy/README.md @@ -1,9 +1,11 @@ # HIP-Basic "SAXPY" Example ## Description + This program demonstrates a simple implementation of the "SAXPY" kernel. The "S" stands for single-precision (i.e. `float`) and "AXPY" stands for the operation performed: $Y_i=aX_i+Y_i$. The simple nature of this example makes it an ideal starting point for developers who are just getting introduced to HIP. -### Application flow +### Application flow + 1. A number of constants are defined to control the problem details and the kernel launch parameters. 2. The two input vectors, $X$ and $Y$ are instantiated in host memory. $X$ is filled with an incrementing sequence starting from 1, whereas $Y$ is filled with ones. 3. The necessary amount of device (GPU) memory is allocated and the elements of the input vectors are copied to the device memory. @@ -14,24 +16,29 @@ This program demonstrates a simple implementation of the "SAXPY" kernel. The "S" 8. The first few elements of the result vector are printed to the standard output. ## Key APIs and Concepts + - `hipMalloc` is used to allocate memory in the global memory of the device (GPU). This is usually necessary, since the kernels running on the device cannot access host (CPU) memory (unless it is device-accessible pinned host memory, see `hipHostMalloc`). Beware, that the memory returned is uninitialized. - `hipFree` de-allocates device memory allocated by `hipMalloc`. It is necessary to free no longer used memory with this function to avoid resource leakage. - `hipMemcpy` is used to transfer bytes between the host and the device memory in both directions. A call to it synchronizes the device with the host, meaning that all kernels queued before `hipMemcpy` will finish before the copying starts. The function returns once the copying has finished. - `myKernelName<<>>(kernelArguments)` queues the execution of the provided kernel on the device. It is asynchronous, the call may return before the execution of the kernel is finished. Its arguments come as the following: - - The kernel (`__global__`) function to launch. - - The number of blocks in the kernel grid, i.e. the grid size. It can be up to 3 dimensions. - - The number of threads in each block, i.e. the block size. It can be up to 3 dimensions. - - The amount of dynamic shared memory provided for the kernel, in bytes. Not used in this example. - - The device stream, on which the kernel is queued. In this example, the default stream is used. - - All further arguments are passed to the kernel function. Notice, that built-in and simple (POD) types may be passed to the kernel, but complex ones (e.g. `std::vector`) usually cannot be. + - The kernel (`__global__`) function to launch. + - The number of blocks in the kernel grid, i.e. the grid size. It can be up to 3 dimensions. + - The number of threads in each block, i.e. the block size. It can be up to 3 dimensions. + - The amount of dynamic shared memory provided for the kernel, in bytes. Not used in this example. + - The device stream, on which the kernel is queued. In this example, the default stream is used. + - All further arguments are passed to the kernel function. Notice, that built-in and simple (POD) types may be passed to the kernel, but complex ones (e.g. `std::vector`) usually cannot be. - `hipGetLastError` returns the error code resulting from the previous operation. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` #### Host symbols + - `hipMalloc` - `hipFree` - `hipMemcpy` diff --git a/HIP-Basic/shared_memory/README.md b/HIP-Basic/shared_memory/README.md index 7e389b96b..291c78c06 100644 --- a/HIP-Basic/shared_memory/README.md +++ b/HIP-Basic/shared_memory/README.md @@ -1,24 +1,27 @@ # HIP-Basic Shared Memory Example ## Description -The shared memory is an on-chip type of memory that is visible to all the threads within the same block, allowing them to communicate by writing and reading data from the same memory space. However, some synchronization among the threads of the block is needed to ensure that all of them have written before trying to access the data. -When using the appropriate access pattern, this memory can provide much less latency than local or global memory (nearly as much as registers), making it a much better option in certain cases. If the size of the shared memory to be used is known at compile time, it can be explicitly specified and it is then known as static shared memory. +The shared memory is an on-chip type of memory that is visible to all the threads within the same block, allowing them to communicate by writing and reading data from the same memory space. However, some synchronization among the threads of the block is needed to ensure that all of them have written before trying to access the data. + +When using the appropriate access pattern, this memory can provide much less latency than local or global memory (nearly as much as registers), making it a much better option in certain cases. If the size of the shared memory to be used is known at compile time, it can be explicitly specified and it is then known as static shared memory. This example implements a simple matrix transpose kernel to showcase how to use static shared memory. -### Application flow +### Application flow + 1. A number of constants are defined for the kernel launch parameters. 2. The input and output matrices are allocated and initialized in host memory. 3. The necessary amount of device memory for the input and output matrices is allocated and the input data is copied to the device. 4. A trace message is printed to the standard output. -5. The GPU kernel is then launched with the previously defined arguments. +5. The GPU kernel is then launched with the previously defined arguments. 6. The transposed matrix is copied back to host memory. 7. All device memory is freed. 8. The expected transposed matrix is calculated with a CPU version of the transpose kernel and the transposed matrix obtained from the kernel execution is then compared with it. The result of the comparison is printed to the standard output. ## Key APIs and Concepts -- `__shared__` is a variable declaration specifier necessary to allocate shared memory from the device. + +- `__shared__` is a variable declaration specifier necessary to allocate shared memory from the device. - `__syncthreads` allows to synchronize all the threads within the same block. This synchronization barrier is used to ensure that every thread in a block have finished writing in shared memory before another threads in the block try to access that data. - `hipMalloc` allocates host device memory in global memory, and with `hipMemcpy` data bytes can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`), among others. - `myKernelName<<<...>>>` queues the execution of a kernel on a device (GPU). @@ -28,16 +31,19 @@ This example implements a simple matrix transpose kernel to showcase how to use ## Demonstrated API Calls ### HIP runtime + - `__global__` - `__shared__` #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` - `__syncthreads` #### Host symbols + - `hipFree` - `hipGetLastError` - `hipMalloc` diff --git a/HIP-Basic/static_device_library/README.md b/HIP-Basic/static_device_library/README.md index 1b1ef2feb..31dd59791 100644 --- a/HIP-Basic/static_device_library/README.md +++ b/HIP-Basic/static_device_library/README.md @@ -1,9 +1,11 @@ # HIP-Basic Device Static Library Example ## Description + This example shows how to create a static library that exports device functions. ### Application flow + 1. A number of constants for the example problem are initialized. 2. A host vector is prepared with an increasing sequence of integers starting from 0. 3. The necessary amount of device (GPU) memory is allocated and the elements of the input vectors are copied to the device memory. @@ -15,30 +17,43 @@ This example shows how to create a static library that exports device functions. 9. The results from the device are compared with the expected results on the host. An error message is printed if the results were not as expected and the function returns with an error code. ## Build Process + Compiling a HIP static library that exports device functions must be done in two steps: + 1. First, the source files that make up the library must be compiled to object files. This is done similarly to how an object file is created for a regular source file (using the `-c` flag), except that the additional option `-fgpu-rdc` must be passed: + ```shell hipcc -c -fgpu-rdc -Ilibrary library/library.hip -o library.o ``` + 2. After compiling all library sources into object files, they must be manually bundled into an archive that can act as static library. `hipcc` cannot currently create this archive automatically, hence it must be created manually using `ar`: + ```shell ar rcsD liblibrary.a library.o ``` + After the static device library has been compiled, it can be linked with another HIP program or library. Linking with a static device library is done by placing it on the command line directly, and additionally requires `-fgpu-rdc`. The static library should be placed on the command line _before_ any source files. Source files that use the static library can also be compiled to object files first, in this case they also need to be compiled with `-fgpu-rdc`: + ```shell hipcc -fgpu-rdc liblibrary.a main.hip -o hip_static_device_library ``` + **Note**: static device libraries _must_ be linked with `hipcc`. There is no support yet for linking such libraries with (ROCm-bundled) clang, using CMake, or using Visual Studio. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` - `__device__` - `__global__` + #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipGetLastError` diff --git a/HIP-Basic/static_host_library/README.md b/HIP-Basic/static_host_library/README.md index 9d90df517..d364020e5 100644 --- a/HIP-Basic/static_host_library/README.md +++ b/HIP-Basic/static_host_library/README.md @@ -1,9 +1,11 @@ # HIP-Basic Host Static Library Example ## Description + This example shows how to create a static library that exports hosts functions. The library may contain both `__global__` and `__device__` code as well, but in this example only `__host__` functions are exported. The resulting library may be linked with other libraries or programs, which do not necessarily need to be HIP libraries or programs. A static host library appears as a regular library, and is compatible with either hipcc or the native system's linker. When using the system linker, the libraries or applications using the static host library do need to be linked with `libamdhip64`. ### Application flow + 1. The `main` function in `main.cpp` calls the library's sole exported function, `run_test`. This symbol is made visible by including the static library's header file. 2. In `run_test` in `library/library.hip`, a number of constants for the example problem are initialized. 3. A vector with input data is initialized in host memory. It is filled with an incrementing sequence starting from 0. @@ -15,49 +17,67 @@ This example shows how to create a static library that exports hosts functions. 9. Control flow returns to `main` in `main.cpp`, which exits the program with the value that was returned from `run_test`. ## Build Process + A HIP static host library is built the same as a regular application, except that the additional flag `--emit-static-lib` must be passed to `hipcc`. Additionally, the library should be compiled with position independent code enabled: + ```shell hipcc library/library.hip -o liblibrary.a --emit-static-lib -fPIC ``` + Linking the static library with another library or object is done in the same way as a regular library: + ```shell hipcc -llibrary -Ilibrary main.cpp -o hip_static_host_library ``` + Note that when linking the library using the host compiler or linker, such as `g++` or `clang++`, the `amdhip64` library should be linked with additionally: + ```shell g++ -L/opt/rocm/lib -llibrary -lamdhip64 -Ilibrary main.cpp -o hip_static_host_library ``` ### CMake + Building a HIP static host library can be done using the CMake `add_library` command: + ```cmake add_library(library_name STATIC library/library.hip) target_include_directories(library_name PUBLIC library) ``` + Note that while the required compilation flags to create a library are passed to the compiler automatically by CMake, position independent code must be turned on manually: + ```cmake set_target_properties(${library_name} PROPERTIES POSITION_INDEPENDENT_CODE ON) ``` + Linking with the static library is done in the same way as regular libraries. If used via `target_link_libraries`, this automatically adds the `amdhip64` dependency: + ```cmake add_executable(excutable_name main.cpp) target_link_libraries(executable_name library_name) ``` ### Visual Studio 2019 + When using Visual Studio 2019 to build a HIP static host library, a separate project can be used to build the static library. This can be set up from scratch by creating a new AMD HIP C++ project, and then converting it to a library by setting `[right click project] -> Properties -> Configuration Properties -> General -> Configuration Type` to `Library`. Linking with a HIP static host library can then be done simply by adding a reference to the corresponding project. This can be done under `[right click project] -> Add -> Reference` by checking the checkbox of the library project, and works both for AMD HIP C++ Visual Studio projects (demonstrated in [static_host_library_vs2019.vcxproj](./static_host_library_vs2019.vcxproj)) as well as regular Windows application Visual Studio projects (demonstrated in [static_host_library_msvc_vs2019.vcxproj](./static_host_library_msvc/static_host_library_msvc_vs2019.vcxproj)). ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` - `__device__` - `__global__` + #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipGetLastError` diff --git a/HIP-Basic/streams/README.md b/HIP-Basic/streams/README.md index 03a500109..950675c33 100644 --- a/HIP-Basic/streams/README.md +++ b/HIP-Basic/streams/README.md @@ -1,9 +1,11 @@ # HIP-Basic Streams Example ## Description + A stream encapsulates a queue of tasks that are launched on the GPU device. This example showcases usage of multiple streams, each with their own tasks. These tasks include asynchronous memory copies using `hipMemcpyAsync` and asynchronous kernel launches using `myKernelName<<<...>>>`. ### Application flow + 1. Host side input and output memory is allocated using `hipHostMalloc` as pinned memory. It will ensure that the memory copies will be performed asynchronously when using `hipMemcpyAsync`. 2. Host input is instantiated. 3. Device side storage is allocated using `hipMalloc`. @@ -17,10 +19,13 @@ A stream encapsulates a queue of tasks that are launched on the GPU device. This 11. Free host side pinned memory using `hipHostFree`. ## Key APIs and Concepts + A HIP stream allows device tasks to be grouped and launched asynchronously and independently from other tasks, which can be used to hide latencies and increase task completion throughput. When results of a task queued on a particular stream are needed, it can be explicitly synchronized without blocking work queued on other streams. Each HIP stream is tied to a particular device, which enables HIP streams to be used to schedule work across multiple devices simultaneously. ## Demonstrated API Calls + ### HIP runtime + - `__shared__` - `__syncthreads` - `hipStream_t` diff --git a/HIP-Basic/texture_management/README.md b/HIP-Basic/texture_management/README.md index 8df3e5de1..105179278 100644 --- a/HIP-Basic/texture_management/README.md +++ b/HIP-Basic/texture_management/README.md @@ -1,9 +1,11 @@ # HIP-Basic Texture Management Example ## Description + This example demonstrates how a kernel may use texture memory through the texture object API. Using texture memory may be beneficial as the texture cache is optimized for 2D spatial locality and exposes features such as hardware filtering. In the example, a texture is created using a device array and is sampled in a kernel to create a histogram of its values. -### Application flow +### Application flow + 1. Check whether texture functions are supported on the device. 2. Initialize the texture data on host side. 3. Specify the channel description of the texture and allocate a device array based on the texture dimensions and channel descriptor. @@ -15,16 +17,20 @@ This example demonstrates how a kernel may use texture memory through the textur 9. Destroy the texture object and release resources. ## Key APIs and Concepts + - The memory for the texture may be a device array `hipArray_t`, which is allocated with `hipMallocArray`. The allocation call requires a channel descriptor `hipChannelFormatDesc` and the dimensions of the texture. The channel descriptor can be created using `hipCreateChannelDesc`. Host data can be transferred to the device array using `hipMemcpy2DToArray`. - The texture object `hipTextureObject_t` is created with `hipCreateTextureObject`, which requires a resource descriptor `hipResourceDesc` and a texture descriptor `hipTextureDesc`. The resource descriptor describes the resource used to create the texture, in this example a device array `hipResourceTypeArray`. The texture descriptor describes the properties of the texture, such as its addressing mode and whether it uses normalized coordinates. - The created texture object can be sampled in a kernel using `tex2D`. - The texture object is cleaned up by calling `hipDestroyTextureObject` and the device array is cleaned up by calling `hipFreeArray`. ## Demonstrated API Calls + ### HIP runtime + - `__global__` #### Device symbols + - `atomicAdd` - `blockDim` - `blockIdx` @@ -32,6 +38,7 @@ This example demonstrates how a kernel may use texture memory through the textur - `threadIdx` #### Host symbols + - `hipArray_t` - `hipAddressModeWrap` - `hipChannelFormatDesc` diff --git a/HIP-Basic/vulkan_interop/README.md b/HIP-Basic/vulkan_interop/README.md index d177dfa67..094d74fdf 100644 --- a/HIP-Basic/vulkan_interop/README.md +++ b/HIP-Basic/vulkan_interop/README.md @@ -1,10 +1,13 @@ # HIP-Basic Vulkan Interop Example ## Description + External device resources and other handles can be shared with HIP in order to provide interoperability between different GPU APIs. This example showcases a HIP program that interacts with the Vulkan API: A HIP kernel is used to simulate a sine wave over a grid of points, in a buffer that is shared with Vulkan. The resulting data is then rendered to a window using the Vulkan API. A set of shared semaphores is used to guarantee synchronous access to the device memory shared between HIP and Vulkan. ### Application flow + #### Initialization + 1. A window is opened using the GLFW library. 2. The Vulkan API is initialized: Function pointers are loaded, the Vulkan instance is created. 3. A physical device is picked to execute the example kernel on and to render the result to the window. This physical device must be the same for HIP and for Vulkan in order to be able to share the required resources. This is done by comparing the device's UUID, which can be queried from a HIP device by querying `hipDeviceGetUuid` and from a Vulkan physical device by passing `VkPhysicalDeviceIDProperties` to `vkGetPhysicalDeviceProperties2`. If the UUIDs from a particular HIP device and Vulkan device are the same, they represent the same physical or virtual device. @@ -20,7 +23,9 @@ External device resources and other handles can be shared with HIP in order to p 13. The Vulkan semaphores are converted to HIP external semaphores. This is done by first exporting a Vulkan semaphore handle to a native semaphore handle, either by `vkGetSemaphoreFdKHR` or `vkGetSemaphoreWin32HandleKHR` depending on the target platform. The resulting handle is passed to `hipImportExternalSemaphore` to obtain the HIP semaphore handle. #### Rendering + A frame is rendered as follows: + 1. The frame resources for the current frame in the frame pipeline are fetched from memory. 2. The next image index is acquired from the swapchain. 3. The command pool associated to the current frame is reset and the associated command buffer is initialized. @@ -32,6 +37,7 @@ A frame is rendered as follows: 9. The swapchain is asked to present the current frame to the screen. ## Key APIs and Concepts + To share memory allocated by Vulkan with HIP, the `VkDeviceMemory` must be created by passing the `VkExportMemoryAllocateInfoKHR` structure to `vkAllocateDeviceMemory`. This structure needs the appropriate `handleTypes` set to a type that can be shared with HIP for the current platform; `VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR` for Linux and `VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR` or `VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR` for Windows. Any Vulkan buffer that is to be associated with this device memory must similarly be created by passing `VkExternalMemoryBufferCreateInfoKHR` to `vkCreateBuffer`, of which the `handleTypes` member must be initialized to the same value. The `VkDeviceMemory` handle can then be exported to a native file descriptor or `HANDLE` using `vkGetMemoryFdKHR` or `vkGetMemoryWin32HandleKHR` respectively on Linux and Windows. A `hipExternalMemory_t` can then be imported from a native handle through `hipImportExternalMemory`. This function must be passed an instance of `hipExternalmemoryHandleDesc`, of which `type` is initialized with a handle type compatible with the Vulkan `handleTypes`. This mapping is as follows: | Vulkan memory handle type | HIP memory handle type | | --------------------------------------------------------- | ------------------------------------------- | @@ -53,68 +59,86 @@ To wait on a shared semaphore in HIP, `hipWaitExternalSemaphoresAsync` should be To signal a shared semaphore in HIP, the `hipSignalExternalSemaphoresAsync` function can be used. This must be passed a number of `hipExternalSemaphoreSignalParams` structures, each corresponding to a semaphore with the same index. When using timeline semaphores, its `fence.value` member should be set to specify the value to which the semaphore should be set. ## Dependencies + This example has additional library dependencies besides HIP: + - [GLFW](https://glfw.org). There are three options for getting this dependency satisfied: - 1. Install it through a package manager. Available for Linux, where GLFW can be installed from some of the usual package managers: - - APT: `apt-get install libglfw3-dev` - - Pacman: `pacman -S glfw-x11` or `pacman -S glfw-wayland` - - DNF: `dnf install glfw-devel` - - It could also happen that the `Xxf68vm` and `Xi` libraries required when linking against Vulkan are not installed. They can be found as well on the previous package managers: - - APT: `apt-get install libxxf86vm-dev libxi-dev` - - Pacman: `pacman -S libxi libxxf86vm` - - DNF: `dnf install libXi-devel libXxf86vm-devel` - 2. Build from source. GLFW supports compilation on Windows with Visual C++ (2010 and later), MinGW and MinGW-w64 and on Linux and other Unix-like systems with GCC and Clang. Please refer to the [compile guide](https://www.glfw.org/docs/latest/compile.html) for a complete guide on how to do this. Note: not only it should be built as explained in the guide, but it is additionally needed to build with the install target (`cmake --build --target install`). - 3. Get the pre-compiled binaries from its [download page](https://www.glfw.org/download). Available for Windows. - - Depending on the build tool used, some extra steps may be needed: - - If using CMake, the `glfw3Config.cmake` and `glfw3Targets.cmake` files must be in a path that CMake searches by default or must be passed using `-DCMAKE_MODULE_PATH`. The official GLFW3 binaries do not ship these files on Windows, and so GLFW must either be compiled manually or obtained from [vcpkg](https://vcpkg.io/), which does ship the required cmake files. - - If the former approach is selected, CMake will be able to find GLFW on Windows if the environment variable `GLFW3_DIR` (or the cmake option `-DCMAKE_PREFIX_PATH`) is set to (contain) the folder owning `glfw3Config.cmake` and `glfw3Targets.cmake`. For instance, if GLFW was installed in `C:\Program Files(x86)\GLFW\`, this will most surely be something like `C:\Program Files (x86)\GLFW\lib\cmake\glfw3\`. - - If the latter, the vcpkg toolchain path should be passed to CMake using `-DCMAKE_TOOLCHAIN_FILE="/path/to/vcpkg/scripts/buildsystems/vcpkg.cmake"`. - - If using Visual Studio, the easiest way to obtain GLFW is by installing `glfw3` from vcpkg. Alternatively, the appropriate path to the GLFW3 library and header directories can be set in `Properties->Linker->General->Additional Library Directories` and `Properties->C/C++->General->Additional Include Directories`. When using this method, the appropriate name for the GLFW library should also be updated under `Properties->C/C++->Linker->Input->Additional Dependencies`. For instance, if the path to the root folder of the Windows binaries installation was `C:\glfw-3.3.8.bin.WIN64\` and we set `GLFW_DIR` with this path, the project configuration file (`.vcxproj`) should end up containing something similar to the following: - ``` - - - ... - $(GLFW_DIR)\include\;;%(AdditionalIncludeDirectories) - ... - - - ... - glfw3dll.lib;;%(AdditionalDependencies) - $(GLFW_DIR)\lib; - ... - - - ``` + + 1. Install it through a package manager. Available for Linux, where GLFW can be installed from some of the usual package managers: + - APT: `apt-get install libglfw3-dev` + - Pacman: `pacman -S glfw-x11` or `pacman -S glfw-wayland` + - DNF: `dnf install glfw-devel` + + It could also happen that the `Xxf68vm` and `Xi` libraries required when linking against Vulkan are not installed. They can be found as well on the previous package managers: + - APT: `apt-get install libxxf86vm-dev libxi-dev` + - Pacman: `pacman -S libxi libxxf86vm` + - DNF: `dnf install libXi-devel libXxf86vm-devel` + + 2. Build from source. GLFW supports compilation on Windows with Visual C++ (2010 and later), MinGW and MinGW-w64 and on Linux and other Unix-like systems with GCC and Clang. Please refer to the [compile guide](https://www.glfw.org/docs/latest/compile.html) for a complete guide on how to do this. Note: not only it should be built as explained in the guide, but it is additionally needed to build with the install target (`cmake --build --target install`). + + 3. Get the pre-compiled binaries from its [download page](https://www.glfw.org/download). Available for Windows. + + Depending on the build tool used, some extra steps may be needed: + + - If using CMake, the `glfw3Config.cmake` and `glfw3Targets.cmake` files must be in a path that CMake searches by default or must be passed using `-DCMAKE_MODULE_PATH`. The official GLFW3 binaries do not ship these files on Windows, and so GLFW must either be compiled manually or obtained from [vcpkg](https://vcpkg.io/), which does ship the required cmake files. + + - If the former approach is selected, CMake will be able to find GLFW on Windows if the environment variable `GLFW3_DIR` (or the cmake option `-DCMAKE_PREFIX_PATH`) is set to (contain) the folder owning `glfw3Config.cmake` and `glfw3Targets.cmake`. For instance, if GLFW was installed in `C:\Program Files(x86)\GLFW\`, this will most surely be something like `C:\Program Files (x86)\GLFW\lib\cmake\glfw3\`. + - If the latter, the vcpkg toolchain path should be passed to CMake using `-DCMAKE_TOOLCHAIN_FILE="/path/to/vcpkg/scripts/buildsystems/vcpkg.cmake"`. + + - If using Visual Studio, the easiest way to obtain GLFW is by installing `glfw3` from vcpkg. Alternatively, the appropriate path to the GLFW3 library and header directories can be set in `Properties->Linker->General->Additional Library Directories` and `Properties->C/C++->General->Additional Include Directories`. When using this method, the appropriate name for the GLFW library should also be updated under `Properties->C/C++->Linker->Input->Additional Dependencies`. For instance, if the path to the root folder of the Windows binaries installation was `C:\glfw-3.3.8.bin.WIN64\` and we set `GLFW_DIR` with this path, the project configuration file (`.vcxproj`) should end up containing something similar to the following: + + ```xml + + + ... + $(GLFW_DIR)\include\;;%(AdditionalIncludeDirectories) + ... + + + ... + glfw3dll.lib;;%(AdditionalDependencies) + $(GLFW_DIR)\lib; + ... + + + ``` + - Vulkan headers. On Linux, the vulkan headers can be directly obtained from some package managers: - - Linux - - APT: `apt-get install -y libvulkan-dev` - - Pacman: `pacman -S vulkan-headers vulkan-icd-loader` - - DNF: `dnf install vulkan-headers vulkan-icd-loader` - But they may be as well obtained by installing the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). CMake will be able to find the SDK using the `VULKAN_SDK` environment variable, which is set by default using the SDK activation script. + - Linux + + - APT: `apt-get install -y libvulkan-dev` + - Pacman: `pacman -S vulkan-headers vulkan-icd-loader` + - DNF: `dnf install vulkan-headers vulkan-icd-loader` + + But they may be as well obtained by installing the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). CMake will be able to find the SDK using the `VULKAN_SDK` environment variable, which is set by default using the SDK activation script. - On Windows, on the other hand, the headers can only be obtained from the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). Contrary to Unix-based OSs, the `VULKAN_SDK` environment variable is not automatically provided on Windows, and so it should be set to the appropriate path before invoking CMake. + On Windows, on the other hand, the headers can only be obtained from the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). Contrary to Unix-based OSs, the `VULKAN_SDK` environment variable is not automatically provided on Windows, and so it should be set to the appropriate path before invoking CMake. - Note that `libvulkan` is _not_ required, as the example loads function pointers dynamically. + Note that `libvulkan` is _not_ required, as the example loads function pointers dynamically. - Validation layers. The `VK_LAYER_KHRONOS_validation` layer is active by default to perform general checks on Vulkan, thus the [Khronos' Vulkan Validation Layers (VVL)](https://github.com/KhronosGroup/Vulkan-ValidationLayers/tree/main#vulkan-validation-layers-vvl) will need to be installed on the system if such checks are desirable. It can be either installed from a package manager (on Linux), built and configured from source or installed as part of the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). -Package managers offering the validation layers package include: - - APT: `apt install vulkan-validationlayers-dev` - - Pacman: `pacman -S vulkan-validation-layers`. Note that with pacman both the validation layers and headers (among others) can be also installed with `pacman -S vulkan-devel`. - - DNF: `dnf install vulkan-validation-layers` - For the second approach, build instructions are provided on [Khronos Vulkan-ValidationLayers repository](https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/BUILD.md) and Vulkan's [Layers Overwiew and Configuration](https://vulkan.lunarg.com/doc/view/latest/windows/layer_configuration.html) document offers several approaches for its configuration. + Package managers offering the validation layers package include: + + - APT: `apt install vulkan-validationlayers-dev` + - Pacman: `pacman -S vulkan-validation-layers`. Note that with pacman both the validation layers and headers (among others) can be also installed with `pacman -S vulkan-devel`. + - DNF: `dnf install vulkan-validation-layers` + + For the second approach, build instructions are provided on [Khronos Vulkan-ValidationLayers repository](https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/BUILD.md) and Vulkan's [Layers Overwiew and Configuration](https://vulkan.lunarg.com/doc/view/latest/windows/layer_configuration.html) document offers several approaches for its configuration. - `glslangValidator`. It is used in the example as a shader validation tool. It may be installed via package manager (`sudo apt install glslang-tools`), by [building manually from source](https://github.com/KhronosGroup/glslang#building-cmake), by downloading the binaries for the corresponding platform directly from the [main-tot](https://github.com/KhronosGroup/glslang/releases/tag/main-tot) release on GitHub or installed as part of the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` #### Host symbols + - `hipComputeModeProhibited` - `hipCUDAErrorTohipError` - `hipDestroyExternalMemory` diff --git a/HIP-Basic/warp_shuffle/README.md b/HIP-Basic/warp_shuffle/README.md index a8bfa2601..2c36d5ce6 100644 --- a/HIP-Basic/warp_shuffle/README.md +++ b/HIP-Basic/warp_shuffle/README.md @@ -1,6 +1,7 @@ # HIP-Basic Warp Shuffle Example ## Description + Kernel code for a particular block is executed in groups of threads known as a _wavefronts_ (AMD) or _warps_ (NVIDIA). Each block is is divided into as many warps as the block's size allows. If the block size is less than the warp size, then part of the warp just stays idle (as happens in this example). AMD GPUs use 64 threads per wavefront for architectures prior to RDNA™ 1. RDNA architectures support both 32 and 64 wavefront sizes. Warps are executed in _lockstep_, i.e. all the threads in each warp execute the same instruction at the same time but with different data. This type of parallel processing is also known as Single Instruction, Multiple Data (SIMD). A block contains several warps and the warp size is dependent on the architecture, but the block size is not. Blocks and warps also differ in the way they are executed, and thus they may provide different results when used in the same piece of code. For instance, the kernel code of this example would not work as it is with block execution and shared memory access e.g. because some synchronization would be needed to ensure that every thread has written its correspondent value before trying to access it. @@ -10,6 +11,7 @@ Higher performance in the execution of kernels can be achieved with explicit war This example showcases how to use the above-mentioned operations by implementing a simple matrix transpose kernel. ### Application flow + 1. A number of constants are defined for the kernel launch parameters. 2. The input and output matrices are allocated and initialized in host memory. 3. The necessary amount of device memory for the input and output matrices is allocated and the input data is copied to the device. @@ -20,6 +22,7 @@ This example showcases how to use the above-mentioned operations by implementing 8. The expected transposed matrix is calculated with a CPU version of the transpose kernel and the transposed matrix obtained from the kernel execution is then compared with it. The result of the comparison is printed to the standard output. ## Key APIs and Concepts + Warp shuffle is a warp-level primitive that allows for the communication between the threads of a warp. Below is a simple example that shows how the value of the thread with index 2 is copied to all other threads within the warp. ![warp_shuffle_simple.svg](warp_shuffle_simple.svg) @@ -38,11 +41,13 @@ Warp shuffle is a warp-level primitive that allows for the communication between ### HIP runtime #### Device symbols + - `__global__` - `threadIdx` - `__shfl` #### Host symbols + - `hipFree` - `hipGetDeviceProperties` - `hipGetLastError` diff --git a/LICENSE.md b/LICENSE.md index cd4c8d43d..b1db69f2e 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -6,4 +6,4 @@ Permission is hereby granted, free of charge, to any person obtaining a copy of The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. \ No newline at end of file +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md b/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md index c0b9c77f5..cb8d31c44 100644 --- a/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md +++ b/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md @@ -3,7 +3,6 @@ Consider the following simple and short demo of using the Address Sanitizer with a HIP application: ```C++ - #include #include @@ -48,8 +47,8 @@ Switching to `--offload-arch=gfx90a:xnack+` in the command above results in a warning-free compilation and an instrumented application. After setting `PATH`, `LD_LIBRARY_PATH` and `HSA_XNACK` as described earlier, a check of the binary with `ldd` yields -``` +```shell $ ldd mini linux-vdso.so.1 (0x00007ffd1a5ae000) libclang_rt.asan-x86_64.so => /opt/rocm-5.7.0-99999/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so (0x00007fb9c14b6000) @@ -75,20 +74,16 @@ $ ldd mini This confirms that the address sanitizer runtime is linked in, and the ASAN instrumented version of the runtime libraries are used. Checking the `PATH` yields -``` - +```shell $ which llvm-symbolizer /opt/rocm-5.7.0-99999/llvm/bin/llvm-symbolizer - ``` Lastly, a check of the OS kernel version yields -``` - +```shell $ uname -rv 5.15.0-73-generic #80~20.04.1-Ubuntu SMP Wed May 17 14:58:14 UTC 2023 - ``` which indicates that the required HMM support (kernel version > 5.6) is available. @@ -96,8 +91,7 @@ This completes the necessary setup. Running with `m = 100`, `n1 = 11`, `n2 = 10` and `c = 100` should produce a report for an invalid access by the last 10 threads. -``` - +```gdb ================================================================= ==3141==ERROR: AddressSanitizer: heap-buffer-overflow on amdgpu device 0 at pc 0x7fb1410d2cc4 WRITE of size 4 in workgroup id (10,0,0) @@ -129,13 +123,11 @@ Shadow byte legend (one shadow byte represents 8 application bytes): Heap left redzone: fa ... ==3141==ABORTING - ``` Running with `m = 100`, `n1 = 10`, `n2 = 10` and `c = 99` should produce a report for an invalid copy. -``` - +```gdb ================================================================= ==2817==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x514000150dcc at pc 0x7f5509551aca bp 0x7ffc90a7ae50 sp 0x7ffc90a7a610 WRITE of size 400 at 0x514000150dcc thread T0 @@ -167,5 +159,4 @@ Shadow byte legend (one shadow byte represents 8 application bytes): Heap left redzone: fa ... ==2817==ABORTING - ``` diff --git a/Libraries/hipBLAS/README.md b/Libraries/hipBLAS/README.md index 26c1bb399..330d58634 100644 --- a/Libraries/hipBLAS/README.md +++ b/Libraries/hipBLAS/README.md @@ -1,30 +1,36 @@ # hipBLAS Examples ## Summary + The examples in this subdirectory showcase the functionality of the [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS) library. The examples build on both Linux and Windows for the ROCm (AMD GPU) backend. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - OR GNU Make - available via the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.2/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) - [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS): `hipblas` package available from [repo.radeon.com](https://repo.radeon.com/rocm/). - ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - The Visual Studio ROCm extension needs to be installed to build with the solution files. - [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS) - - Installed as part of the ROCm SDK on Windows for ROCm platform. + - Installed as part of the ROCm SDK on Windows for ROCm platform. - [CMake](https://cmake.org/download/) (optional, to build with CMake. Requires at least version 3.21) - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building + ### Linux + Make sure that the dependencies are installed, or use the [provided Dockerfiles](../../Dockerfiles/) to build and run the examples in a containerized environment that has all prerequisites installed. #### Using CMake + All examples in the `hipBLAS` subdirectory can either be built by a single CMake project or be built independently. - `$ cd Libraries/hipBLAS` @@ -32,16 +38,20 @@ All examples in the `hipBLAS` subdirectory can either be built by a single CMake - `$ cmake --build build` #### Using Make + All examples can be built by a single invocation to Make or be built independently. - `$ cd Libraries/hipBLAS` - `$ make` ### Windows + #### Visual Studio + Visual Studio solution files are available for the individual examples. To build all examples for hipBLAS open the top level solution file [ROCm-Examples-VS2019.sln](../../ROCm-Examples-VS2019.sln) and filter for hipBLAS. For more detailed build instructions refer to the top level [README.md](../../README.md#visual-studio). #### CMake + All examples in the `hipBLAS` subdirectory can either be built by a single CMake project or be built independently. For build instructions refer to the top-level [README.md](../../README.md#cmake-2). diff --git a/Libraries/hipBLAS/gemm_strided_batched/README.md b/Libraries/hipBLAS/gemm_strided_batched/README.md index b7bdc8eef..094bd2e78 100644 --- a/Libraries/hipBLAS/gemm_strided_batched/README.md +++ b/Libraries/hipBLAS/gemm_strided_batched/README.md @@ -1,6 +1,7 @@ # hipBLAS Level 3 Generalized Matrix Multiplication Strided Batched Example ## Description + This example illustrates the use of the hipBLAS Level 3 Strided Batched General Matrix Multiplication. The hipBLAS GEMM STRIDED BATCHED performs a matrix--matrix operation for a _batch_ of matrices as: $C[i] = \alpha \cdot A[i]' \cdot B[i]' + \beta \cdot (C[i])$ @@ -14,8 +15,8 @@ In this example the identity is used. $\alpha$ and $\beta$ are scalars, and $A$, $B$ and $C$ are the batches of matrices. For each $i$, $A[i]$, $B[i]$ and $C[i]$ are matrices such that $A_i'$ is an $m \times k$ matrix, $B_i'$ a $k \times n$ matrix and $C_i$ an $m \times n$ matrix. - ### Application flow + 1. Read in command-line parameters. 2. Set dimension variables of the matrices and get the batch count. 3. Allocate and initialize the host matrices. Set up $B$ matrix as an identity matrix. @@ -30,6 +31,7 @@ $A_i'$ is an $m \times k$ matrix, $B_i'$ a $k \times n$ matrix and $C_i$ an $m \ 12. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: - `-a` or `--alpha`. The scalar value $\alpha$ used in the GEMM operation. Its default value is 1. - `-b` or `--beta`. The scalar value $\beta$ used in the GEMM operation. Its default value is 1. @@ -39,49 +41,53 @@ The application provides the following optional command line arguments: - `-k` or `--k`. The number of columns of matrix $A$ and rows of matrix $B$, which must be greater than 0. Its default value is 5. ## Key APIs and Concepts + - The performance of a numerical multi-linear algebra code can be heavily increased by using tensor contractions [ [Y. Shi et al., HiPC, pp 193, 2016.](https://doi.org/10.1109/HiPC.2016.031) ], thereby most of the hipBLAS functions have a`_batched` and a `_strided_batched` [ [C. Jhurani and P. Mullowney, JPDP Vol 75, pp 133, 2015.](https://doi.org/10.1016/j.jpdc.2014.09.003) ] extensions.
We can apply the same multiplication operator for several matrices if we combine them into batched matrices. Batched matrix multiplication has a performance improvement for a large number of small matrices. For a constant stride between matrices, further acceleration is available by strided batched GEMM. - hipBLAS is initialized by calling `hipblasCreate(hipblasHandle*)` and it is terminated by calling `hipblasDestroy(hipblasHandle)`. - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is controlled by `hipblasSetPointerMode`. - The symbol $X'$ denotes the following operations, as defined in the Description section: - - `HIPBLAS_OP_N`: identity operator ($X' = X$), - - `HIPBLAS_OP_T`: transpose operator ($X' = X^T$) or - - `HIPBLAS_OP_C`: Hermitian (conjugate transpose) operator ($X' = X^H$). + + - `HIPBLAS_OP_N`: identity operator ($X' = X$), + - `HIPBLAS_OP_T`: transpose operator ($X' = X^T$) or + - `HIPBLAS_OP_C`: Hermitian (conjugate transpose) operator ($X' = X^H$). - `hipblasStride` strides between matrices or vectors in strided_batched functions. - `hipblas[HSDCZ]gemmStridedBatched` Depending on the character matched in `[HSDCZ]`, the norm can be obtained with different precisions: - - `H`(half-precision: `hipblasHalf`) - - `S` (single-precision: `float`) - - `D` (double-precision: `double`) - - `C` (single-precision complex: `hipblasComplex`) - - `Z` (double-precision complex: `hipblasDoubleComplex`). - - Input parameters for `hipblasSgemmStridedBatched`: - - `hipblasHandle_t handle` - - `hipblasOperation_t trans_a`: transformation operator on each $A_i$ matrix - - `hipblasOperation_t trans_b`: transformation operator on each $B_i$ matrix - - `int m`: number of rows in each $A_i'$ and $C$ matrices - - `int n`: number of columns in each $B_i'$ and $C$ matrices - - `int k`: number of columns in each $A_i'$ matrix and number of rows in each $B_i'$ matrix - - `const float *alpha`: scalar multiplier of each $C_i$ matrix addition - - `const float *A`: pointer to the each $A_i$ matrix - - `int lda`: leading dimension of each $A_i$ matrix - - `long long stride_a`: stride size for each $A_i$ matrix - - `const float *B`: pointer to each $B_i$ matrix - - `int ldb`: leading dimension of each $B_i$ matrix - - `const float *beta`: scalar multiplier of the $B \cdot C$ matrix product - - `long long stride_b`: stride size for each $B_i$ matrix - - `float *C`: pointer to each $C_i$ matrix - - `int ldc`: leading dimension of each $C_i$ matrix - - `long long stride_c`: stride size for each $C_i$ matrix - - `int batch_count`: number of matrices - - Return value: `hipblasStatus_t ` + - `H`(half-precision: `hipblasHalf`) + - `S` (single-precision: `float`) + - `D` (double-precision: `double`) + - `C` (single-precision complex: `hipblasComplex`) + - `Z` (double-precision complex: `hipblasDoubleComplex`). + + Input parameters for `hipblasSgemmStridedBatched`: + + - `hipblasHandle_t handle` + - `hipblasOperation_t trans_a`: transformation operator on each $A_i$ matrix + - `hipblasOperation_t trans_b`: transformation operator on each $B_i$ matrix + - `int m`: number of rows in each $A_i'$ and $C$ matrices + - `int n`: number of columns in each $B_i'$ and $C$ matrices + - `int k`: number of columns in each $A_i'$ matrix and number of rows in each $B_i'$ matrix + - `const float *alpha`: scalar multiplier of each $C_i$ matrix addition + - `const float *A`: pointer to the each $A_i$ matrix + - `int lda`: leading dimension of each $A_i$ matrix + - `long long stride_a`: stride size for each $A_i$ matrix + - `const float *B`: pointer to each $B_i$ matrix + - `int ldb`: leading dimension of each $B_i$ matrix + - `const float *beta`: scalar multiplier of the $B \cdot C$ matrix product + - `long long stride_b`: stride size for each $B_i$ matrix + - `float *C`: pointer to each $C_i$ matrix + - `int ldc`: leading dimension of each $C_i$ matrix + - `long long stride_c`: stride size for each $C_i$ matrix + - `int batch_count`: number of matrices + + Return value: `hipblasStatus_t ` ## Demonstrated API Calls ### hipBLAS + - `hipblasCreate` - `hipblasDestroy` - `hipblasHandle_t` @@ -93,6 +99,7 @@ We can apply the same multiplication operator for several matrices if we combine - `HIPBLAS_POINTER_MODE_HOST` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipBLAS/her/README.md b/Libraries/hipBLAS/her/README.md index 99ef1fff1..986c8a171 100644 --- a/Libraries/hipBLAS/her/README.md +++ b/Libraries/hipBLAS/her/README.md @@ -1,6 +1,7 @@ # hipBLAS Level 2 Hermitian Rank-2 Update Example ## Description + This example showcases the usage of the hipBLAS Level2 Hermitian rank-2 update functionality. The hipBLAS HER2 function performs a Hermitian rank-2 update operation, which is defined as follows: $A = A + \alpha\cdot x\cdot y^H + \bar\alpha \cdot y \cdot x^H$, @@ -8,36 +9,40 @@ $A = A + \alpha\cdot x\cdot y^H + \bar\alpha \cdot y \cdot x^H$, where $A$ is an $n \times n$ Hermitian complex matrix, $x$ and $y$ are complex vectors of $n$ elements, $\alpha$ is a complex scalar and $v^H$ is the _Hermitian transpose_ of a vector $v \in \mathbb{C}^n$. ### Application flow + 1. Read in command-line parameters. 2. Allocate and initialize the host vectors and matrix. 3. Compute CPU reference result. 4. Create a hipBLAS handle. 5. Allocate and initialize the device vectors and matrix. 6. Copy input vectors and matrix from host to device. -6. Invoke the hipBLAS HER2 function. -7. Copy the result from device to host. -8. Destroy the hipBLAS handle and release device memory. -9. Validate the output by comparing it to the CPU reference result. +7. Invoke the hipBLAS HER2 function. +8. Copy the result from device to host. +9. Destroy the hipBLAS handle and release device memory. +10. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $\alpha$ used in the HER2 operation. Its default value is 1. - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, which must be greater than 0. Its default value is 1. - `-y` or `--incy`. The stride between consecutive values in the data array that makes up vector $y$, which must be greater than 0. Its default value is 1. - `-n` or `--n`. The dimension of matrix $A$ and vectors $x$ and $y$, which must be greater than 0. Its default value is 5. ## Key APIs and Concepts + - hipBLAS is initialized by calling `hipblasCreate(hipblasHandle_t*)` and it is terminated by calling `hipblasDestroy(hipblasHandle_t)`. - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is controlled by `hipblasSetPointerMode`. - `hipblasSetVector` and `hipblasSetMatrix` are helper functions provided by the hipBLAS API for writing data to the GPU, whereas `hipblasGetVector` and `hipblasGetMatrix` are intended for retrieving data from the GPU. Note that `hipMemcpy` can also be used to copy/get data to/from the GPU in the usual way. - `hipblas[CZ]her2(handle, uplo, n, *alpha, *x, incx, *y, incy, *AP, lda)` computes a Hermitian rank-2 update. The character matched in `[CZ]` denotes the data type of the operation, and can be either `C` (complex float: `hipblasComplex`), or `Z` (complex double: `hipblasDoubleComplex`). The required arguments come as the following: - - `handle`, the hipBLAS API handle. - - `uplo`. Because a Hermitian matrix is symmetric over the diagonal, except that the values in the upper triangle are the complex conjugate of the values in the lower triangle, the required work can be reduced by only updating a single half of the matrix. The part of the matrix to update is given by `uplo`: `HIPBLAS_FILL_MODE_UPPER` (used in this example) indicates that the upper triangle of $A$ should be updated, `HIPBLAS_FILL_MODE_LOWER` indicates that the lower triangle of $A$ should be updated and `HIPBLAS_FILL_MODE_FULL` indicates that the full matrix will be updated. - - `n` gives the dimensions of the vector and matrix inputs. - - `alpha` is the complex scalar. - - `x` and `y` are the input vectors, and `incx` and `incy` are the increments in elements between items of $x$ and $y$, respectively. - - `AP` is the device pointer to matrix $A$ in device memory. - - `lda` is the _leading dimension_ of $A$, that is, the number of elements between the starts of the columns of $A$. Note that hipBLAS matrices are laid out in _column major_ ordering. + - `handle`, the hipBLAS API handle. + - `uplo`. Because a Hermitian matrix is symmetric over the diagonal, except that the values in the upper triangle are the complex conjugate of the values in the lower triangle, the required work can be reduced by only updating a single half of the matrix. The part of the matrix to update is given by `uplo`: `HIPBLAS_FILL_MODE_UPPER` (used in this example) indicates that the upper triangle of $A$ should be updated, `HIPBLAS_FILL_MODE_LOWER` indicates that the lower triangle of $A$ should be updated and `HIPBLAS_FILL_MODE_FULL` indicates that the full matrix will be updated. + - `n` gives the dimensions of the vector and matrix inputs. + - `alpha` is the complex scalar. + - `x` and `y` are the input vectors, and `incx` and `incy` are the increments in elements between items of $x$ and $y$, respectively. + - `AP` is the device pointer to matrix $A$ in device memory. + - `lda` is the _leading dimension_ of $A$, that is, the number of elements between the starts of the columns of $A$. Note that hipBLAS matrices are laid out in _column major_ ordering. - If `ROCM_MATHLIBS_API_USE_HIP_COMPLEX` is defined (adding `#define ROCM_MATHLIBS_API_USE_HIP_COMPLEX` before `#include `), the hipBLAS API is exposed as using the hip defined complex types. That is, `hipblasComplex` is a typedef of `hipFloatComplex` (also named `hipComplex`) and they can be used equivalently. - `hipFloatComplex` and `std::complex` have compatible memory layout, and performing a memory copy between values of these types will correctly perform the expected copy. @@ -46,6 +51,7 @@ The application provides the following optional command line arguments: ## Demonstrated API Calls ### hipBLAS + - `HIPBLAS_FILL_MODE_UPPER` - `HIPBLAS_POINTER_MODE_HOST` - `hipblasCher2` @@ -59,6 +65,7 @@ The application provides the following optional command line arguments: - `hipblasSetVector` ### HIP runtime + - `ROCM_MATHLIBS_API_USE_HIP_COMPLEX` - `hipCaddf` - `hipFloatComplex` diff --git a/Libraries/hipBLAS/scal/README.md b/Libraries/hipBLAS/scal/README.md index 4ed540dbe..401a643ff 100644 --- a/Libraries/hipBLAS/scal/README.md +++ b/Libraries/hipBLAS/scal/README.md @@ -1,9 +1,11 @@ # hipBLAS Level 1 Scal Example ## Description + This example showcases the usage of hipBLAS' Level 1 SCAL function. The Level 1 API defines operations between vector and vector. SCAL is a scaling operator for an $x$ vector defined as $x_i := \alpha \cdot x_i$. -### Application flow +### Application flow + 1. Read in and parse command line parameters. 2. Allocate and initialize host vector. 3. Compute CPU reference result. @@ -13,26 +15,30 @@ This example showcases the usage of hipBLAS' Level 1 SCAL function. The Level 1 7. Call hipBLAS' SCAL function. 8. Copy the result from device to host. 9. Destroy the hipBLAS handle, release device memory. -10. Validate the output by comparing it to the CPU reference result. +10. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $a$ used in the SCAL operation. Its default value is 3. - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, must be greater than zero. Its default value is 1. - `-n` or `--n`. The number of elements in vector $x$, must be greater than zero. Its default value is 5. ## Key APIs and Concepts + - hipBLAS is initialized by calling `hipblasCreate(hipblasHandle_t *handle)` and it is terminated by calling `hipblasDestroy(hipblasHandle_t handle)`. - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is controlled by `hipblasSetPointerMode`. - `hipblas[SDCZ]scal` multiplies each element of the vector by a scalar. Depending on the character matched in `[SDCZ]`, the scaling can be obtained with different precisions: - - `S` (single-precision: `float`) - - `D` (double-precision: `double`) - - `C` (single-precision complex: `hipblasComplex`) - - `Z` (double-precision complex: `hipblasDoubleComplex`). + - `S` (single-precision: `float`) + - `D` (double-precision: `double`) + - `C` (single-precision complex: `hipblasComplex`) + - `Z` (double-precision complex: `hipblasDoubleComplex`). ## Demonstrated API Calls ### hipBLAS + - `hipblasCreate` - `hipblasDestroy` - `hipblasHandle_t` @@ -41,6 +47,7 @@ The application provides the following optional command line arguments: - `hipblasSscal` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/README.md b/README.md index 1e3f67b0b..f68687dd0 100644 --- a/README.md +++ b/README.md @@ -3,135 +3,139 @@ A collection of examples to enable new users to start using ROCm. Advanced users may learn about new functionality through our advanced examples. ## Repository Contents + - [AI](https://github.com/ROCm/rocm-examples/tree/develop/AI/MIGraphX/Quantization) Showcases the functionality for executing quantized models using Torch-MIGraphX. - [Applications](https://github.com/ROCm/rocm-examples/tree/develop/Applications/) groups a number of examples ... . - - [bitonic_sort](https://github.com/ROCm/rocm-examples/tree/develop/Applications/bitonic_sort/): Showcases how to order an array of $n$ elements using a GPU implementation of the bitonic sort. - - [convolution](https://github.com/ROCm/rocm-examples/tree/develop/Applications/convolution/): A simple GPU implementation for the calculation of discrete convolutions. - - [floyd_warshall](https://github.com/ROCm/rocm-examples/tree/develop/Applications/floyd_warshall/): Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs. - - [histogram](https://github.com/ROCm/rocm-examples/tree/develop/Applications/histogram/): Histogram over a byte array with memory bank optimization. - - [monte_carlo_pi](https://github.com/ROCm/rocm-examples/tree/develop/Applications/monte_carlo_pi/): Monte Carlo estimation of $\pi$ using hipRAND for random number generation and hipCUB for evaluation. - - [prefix_sum](https://github.com/ROCm/rocm-examples/tree/develop/Applications/prefix_sum/): Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm. + - [bitonic_sort](https://github.com/ROCm/rocm-examples/tree/develop/Applications/bitonic_sort/): Showcases how to order an array of $n$ elements using a GPU implementation of the bitonic sort. + - [convolution](https://github.com/ROCm/rocm-examples/tree/develop/Applications/convolution/): A simple GPU implementation for the calculation of discrete convolutions. + - [floyd_warshall](https://github.com/ROCm/rocm-examples/tree/develop/Applications/floyd_warshall/): Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs. + - [histogram](https://github.com/ROCm/rocm-examples/tree/develop/Applications/histogram/): Histogram over a byte array with memory bank optimization. + - [monte_carlo_pi](https://github.com/ROCm/rocm-examples/tree/develop/Applications/monte_carlo_pi/): Monte Carlo estimation of $\pi$ using hipRAND for random number generation and hipCUB for evaluation. + - [prefix_sum](https://github.com/ROCm/rocm-examples/tree/develop/Applications/prefix_sum/): Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm. - [Common](https://github.com/ROCm/rocm-examples/tree/develop/Common/) contains common utility functionality shared between the examples. - [HIP-Basic](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/) hosts self-contained recipes showcasing HIP runtime functionality. - - [assembly_to_executable](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/assembly_to_executable): Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code. - - [bandwidth](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/bandwidth): Program that measures memory bandwidth from host to device, device to host, and device to device. - - [bit_extract](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/bit_extract): Program that showcases how to use HIP built-in bit extract. - - [device_globals](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/device_globals): Show cases how to set global variables on the device from the host. - - [device_query](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/device_query): Program that showcases how properties from the device may be queried. - - [dynamic_shared](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/dynamic_shared): Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel. - - [events](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/events/): Measuring execution time and synchronizing with HIP events. - - [gpu_arch](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/gpu_arch/): Program that showcases how to implement GPU architecture-specific code. - - [hello_world](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hello_world): Simple program that showcases launching kernels and printing from the device. - - [hipify](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hipify): Simple program and build definitions that showcase automatically converting a CUDA `.cu` source into portable HIP `.hip` source. - - [llvm_ir_to_executable](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/llvm_ir_to_executable): Shows how to create a HIP executable from LLVM IR. - - [inline_assembly](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/inline_assembly/): Program that showcases how to use inline assembly in a portable manner. - - [matrix_multiplication](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/matrix_multiplication/): Multiply two dynamically sized matrices utilizing shared memory. - - [module_api](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/module_api/): Shows how to load and execute a HIP module in runtime. - - [moving_average](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/moving_average/): Simple program that demonstrates parallel computation of a moving average of one-dimensional data. - - [multi_gpu_data_transfer](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/multi_gpu_data_transfer/): Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices. - - [occupancy](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/occupancy/): Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy. - - [opengl_interop](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/opengl_interop): Showcases how to share resources and computation between HIP and OpenGL. - - [runtime_compilation](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/runtime_compilation/): Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device. - - [saxpy](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/saxpy/): Implements the $y_i=ax_i+y_i$ kernel and explains basic HIP functionality. - - [shared_memory](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/shared_memory/): Showcases how to use static shared memory by implementing a simple matrix transpose kernel. - - [static_device_library](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_device_library): Shows how to create a static library containing device functions, and how to link it with an executable. - - [static_host_library](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_host_library): Shows how to create a static library containing HIP host functions, and how to link it with an executable. - - [streams](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/streams/): Program that showcases usage of multiple streams each with their own tasks. - - [texture_management](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/texture_management/): Shows the usage of texture memory. - - [vulkan_interop](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/vulkan_interop): Showcases how to share resources and computation between HIP and Vulkan. - - [warp_shuffle](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/warp_shuffle/): Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations. + - [assembly_to_executable](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/assembly_to_executable): Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code. + - [bandwidth](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/bandwidth): Program that measures memory bandwidth from host to device, device to host, and device to device. + - [bit_extract](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/bit_extract): Program that showcases how to use HIP built-in bit extract. + - [device_globals](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/device_globals): Show cases how to set global variables on the device from the host. + - [device_query](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/device_query): Program that showcases how properties from the device may be queried. + - [dynamic_shared](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/dynamic_shared): Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel. + - [events](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/events/): Measuring execution time and synchronizing with HIP events. + - [gpu_arch](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/gpu_arch/): Program that showcases how to implement GPU architecture-specific code. + - [hello_world](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hello_world): Simple program that showcases launching kernels and printing from the device. + - [hipify](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hipify): Simple program and build definitions that showcase automatically converting a CUDA `.cu` source into portable HIP `.hip` source. + - [llvm_ir_to_executable](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/llvm_ir_to_executable): Shows how to create a HIP executable from LLVM IR. + - [inline_assembly](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/inline_assembly/): Program that showcases how to use inline assembly in a portable manner. + - [matrix_multiplication](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/matrix_multiplication/): Multiply two dynamically sized matrices utilizing shared memory. + - [module_api](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/module_api/): Shows how to load and execute a HIP module in runtime. + - [moving_average](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/moving_average/): Simple program that demonstrates parallel computation of a moving average of one-dimensional data. + - [multi_gpu_data_transfer](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/multi_gpu_data_transfer/): Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices. + - [occupancy](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/occupancy/): Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy. + - [opengl_interop](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/opengl_interop): Showcases how to share resources and computation between HIP and OpenGL. + - [runtime_compilation](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/runtime_compilation/): Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device. + - [saxpy](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/saxpy/): Implements the $y_i=ax_i+y_i$ kernel and explains basic HIP functionality. + - [shared_memory](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/shared_memory/): Showcases how to use static shared memory by implementing a simple matrix transpose kernel. + - [static_device_library](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_device_library): Shows how to create a static library containing device functions, and how to link it with an executable. + - [static_host_library](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_host_library): Shows how to create a static library containing HIP host functions, and how to link it with an executable. + - [streams](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/streams/): Program that showcases usage of multiple streams each with their own tasks. + - [texture_management](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/texture_management/): Shows the usage of texture memory. + - [vulkan_interop](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/vulkan_interop): Showcases how to share resources and computation between HIP and Vulkan. + - [warp_shuffle](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/warp_shuffle/): Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations. - [Dockerfiles](https://github.com/ROCm/rocm-examples/tree/develop/Dockerfiles/) hosts Dockerfiles with ready-to-use environments for the various samples. See [Dockerfiles/README.md](https://github.com/ROCm/rocm-examples/tree/develop/Dockerfiles/README.md) for details. - [Docs](https://github.com/ROCm/rocm-examples/tree/develop/Docs/) - - [CONTRIBUTING.md](https://github.com/ROCm/rocm-examples/tree/develop/Docs/CONTRIBUTING.md) contains information on how to contribute to the examples. + - [CONTRIBUTING.md](https://github.com/ROCm/rocm-examples/tree/develop/Docs/CONTRIBUTING.md) contains information on how to contribute to the examples. - [Libraries](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/) - - [hipBLAS](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/) - - [gemm_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/gemm_strided_batched/): Showcases the general matrix product operation with strided and batched matrices. - - [her](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/her/): Showcases a rank-2 update of a Hermitian matrix with complex values. - - [scal](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/scal/): Simple program that showcases vector scaling (SCAL) operation. - - [hipCUB](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/) - - [device_radix_sort](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/device_radix_sort/): Simple program that showcases `hipcub::DeviceRadixSort::SortPairs`. - - [device_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/device_sum/): Simple program that showcases `hipcub::DeviceReduce::Sum`. - - [hipSOLVER](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/) - - [gels](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/gels/): Solve a linear system of the form $A\times X=B$. - - [geqrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/geqrf/): Program that showcases how to obtain a QR decomposition with the hipSOLVER API. - - [gesvd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/gesvd/): Program that showcases how to obtain a singular value decomposition with the hipSOLVER API. - - [getrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/getrf): Program that showcases how to perform a LU factorization with hipSOLVER. - - [potrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/potrf/): Perform Cholesky factorization and solve linear system with result. - - [syevd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevd/): Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER. - - [syevdx](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevdx/): Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER. - - [sygvd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/sygvd/): Showcases how to obtain a solution $(X, \Lambda)$ for a generalized symmetric-definite eigenvalue problem of the form $A \cdot X = B\cdot X \cdot \Lambda$. - - [syevj](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevj): Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method. - - [syevj_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevj_batched): Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices. - - [sygvj](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/sygvj): Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method. - - [rocBLAS](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/) - - [level_1](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/): Operations between vectors and vectors. - - [axpy](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/axpy/): Simple program that showcases the AXPY operation. - - [dot](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/dot/): Simple program that showcases dot product. - - [nrm2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/nrm2/): Simple program that showcases Euclidean norm of a vector. - - [scal](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/scal/): Simple program that showcases vector scaling (SCAL) operation. - - [swap](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/swap/): Showcases exchanging elements between two vectors. - - [level_2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/): Operations between vectors and matrices. - - [her](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/her/): Showcases a rank-1 update of a Hermitian matrix with complex values. - - [gemv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/gemv/): Showcases the general matrix-vector product operation. - - [level_3](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/): Operations between matrices and matrices. - - [gemm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/gemm/): Showcases the general matrix product operation. - - [gemm_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/gemm_strided_batched/): Showcases the general matrix product operation with strided and batched matrices. - - [rocPRIM](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/) - - [block_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/block_sum/): Simple program that showcases `rocprim::block_reduce` with an addition operator. - - [device_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/device_sum/): Simple program that showcases `rocprim::reduce` with an addition operator. - - [rocRAND](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocRAND/) - - [simple_distributions_cpp](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocRAND/simple_distributions_cpp/): A command-line app to compare random number generation on the CPU and on the GPU with rocRAND. - - [rocSOLVER](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/) - - [getf2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/getf2): Program that showcases how to perform a LU factorization with rocSOLVER. - - [getri](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/getri): Program that showcases matrix inversion by LU-decomposition using rocSOLVER. - - [syev](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev): Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix. - - [syev_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev_batched): Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices. - - [syev_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev_strided_batched): Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride. - - [rocSPARSE](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/) - - [level_2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/): Operations between sparse matrices and dense vectors. - - [bsrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrmv/): Showcases a sparse matrix-vector multiplication using BSR storage format. - - [bsrxmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrxmv/): Showcases a masked sparse matrix-vector multiplication using BSR storage format. - - [bsrsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix. - - [coomv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/coomv/): Showcases a sparse matrix-vector multiplication using COO storage format. - - [csrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/csrmv/): Showcases a sparse matrix-vector multiplication using CSR storage format. - - [csrsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/csrsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix. - - [ellmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/ellmv/): Showcases a sparse matrix-vector multiplication using ELL storage format. - - [gebsrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/gebsrmv/): Showcases a sparse matrix-dense vector multiplication using GEBSR storage format. - - [gemvi](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/gemvi/): Showcases a dense matrix-sparse vector multiplication. - - [spmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/spmv/): Showcases a general sparse matrix-dense vector multiplication. - - [spsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/spsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a sparse triangular matrix. - - [level_3](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/): Operations between sparse and dense matrices. - - [bsrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/bsrmm/): Showcases a sparse matrix-matrix multiplication using BSR storage format. - - [bsrsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/bsrsm): Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices. - - [csrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/csrmm/): Showcases a sparse matrix-matrix multiplication using CSR storage format. - - [csrsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/csrsm): Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices. - - [gebsrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/gebsrmm/): Showcases a sparse matrix-matrix multiplication using GEBSR storage format. - - [gemmi](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/gemmi/): Showcases a dense matrix sparse matrix multiplication using CSR storage format. - - [sddmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/sddmm/): Showcases a sampled dense-dense matrix multiplication using CSR storage format. - - [spmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/spmm/): Showcases a sparse matrix-dense matrix multiplication. - - [spsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/spsm/): Showcases a sparse triangular linear system solver using CSR storage format. - - [preconditioner](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/): Manipulations on sparse matrices to obtain sparse preconditioner matrices. - - [bsric0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/bsric0/): Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix. - - [bsrilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/bsrilu0/): Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix. - - [csric0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csric0/): Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix. - - [csrilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csrilu0/): Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix. - - [csritilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csritilu0/): Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix. - - [rocThrust](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/) - - [device_ptr](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/device_ptr/): Simple program that showcases the usage of the `thrust::device_ptr` template. - - [norm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/norm/): An example that computes the Euclidean norm of a `thrust::device_vector`. - - [reduce_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/reduce_sum/): An example that computes the sum of a `thrust::device_vector` integer vector using the `thrust::reduce()` generalized summation and the `thrust::plus` operator. - - [remove_points](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/remove_points/): Simple program that demonstrates the usage of the `thrust` random number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle. - - [saxpy](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/saxpy/): Simple program that implements the SAXPY operation (`y[i] = a * x[i] + y[i]`) using rocThrust and showcases the usage of the vector and functor templates and of `thrust::fill` and `thrust::transform` operations. - - [vectors](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/vectors/): Simple program that showcases the `host_vector` and the `device_vector` of rocThrust. + - [hipBLAS](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/) + - [gemm_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/gemm_strided_batched/): Showcases the general matrix product operation with strided and batched matrices. + - [her](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/her/): Showcases a rank-2 update of a Hermitian matrix with complex values. + - [scal](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/scal/): Simple program that showcases vector scaling (SCAL) operation. + - [hipCUB](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/) + - [device_radix_sort](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/device_radix_sort/): Simple program that showcases `hipcub::DeviceRadixSort::SortPairs`. + - [device_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/device_sum/): Simple program that showcases `hipcub::DeviceReduce::Sum`. + - [hipSOLVER](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/) + - [gels](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/gels/): Solve a linear system of the form $A\times X=B$. + - [geqrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/geqrf/): Program that showcases how to obtain a QR decomposition with the hipSOLVER API. + - [gesvd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/gesvd/): Program that showcases how to obtain a singular value decomposition with the hipSOLVER API. + - [getrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/getrf): Program that showcases how to perform a LU factorization with hipSOLVER. + - [potrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/potrf/): Perform Cholesky factorization and solve linear system with result. + - [syevd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevd/): Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER. + - [syevdx](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevdx/): Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER. + - [sygvd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/sygvd/): Showcases how to obtain a solution $(X, \Lambda)$ for a generalized symmetric-definite eigenvalue problem of the form $A \cdot X = B\cdot X \cdot \Lambda$. + - [syevj](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevj): Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method. + - [syevj_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevj_batched): Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices. + - [sygvj](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/sygvj): Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method. + - [rocBLAS](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/) + - [level_1](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/): Operations between vectors and vectors. + - [axpy](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/axpy/): Simple program that showcases the AXPY operation. + - [dot](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/dot/): Simple program that showcases dot product. + - [nrm2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/nrm2/): Simple program that showcases Euclidean norm of a vector. + - [scal](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/scal/): Simple program that showcases vector scaling (SCAL) operation. + - [swap](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/swap/): Showcases exchanging elements between two vectors. + - [level_2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/): Operations between vectors and matrices. + - [her](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/her/): Showcases a rank-1 update of a Hermitian matrix with complex values. + - [gemv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/gemv/): Showcases the general matrix-vector product operation. + - [level_3](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/): Operations between matrices and matrices. + - [gemm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/gemm/): Showcases the general matrix product operation. + - [gemm_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/gemm_strided_batched/): Showcases the general matrix product operation with strided and batched matrices. + - [rocPRIM](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/) + - [block_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/block_sum/): Simple program that showcases `rocprim::block_reduce` with an addition operator. + - [device_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/device_sum/): Simple program that showcases `rocprim::reduce` with an addition operator. + - [rocRAND](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocRAND/) + - [simple_distributions_cpp](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocRAND/simple_distributions_cpp/): A command-line app to compare random number generation on the CPU and on the GPU with rocRAND. + - [rocSOLVER](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/) + - [getf2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/getf2): Program that showcases how to perform a LU factorization with rocSOLVER. + - [getri](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/getri): Program that showcases matrix inversion by LU-decomposition using rocSOLVER. + - [syev](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev): Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix. + - [syev_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev_batched): Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices. + - [syev_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev_strided_batched): Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride. + - [rocSPARSE](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/) + - [level_2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/): Operations between sparse matrices and dense vectors. + - [bsrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrmv/): Showcases a sparse matrix-vector multiplication using BSR storage format. + - [bsrxmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrxmv/): Showcases a masked sparse matrix-vector multiplication using BSR storage format. + - [bsrsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix. + - [coomv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/coomv/): Showcases a sparse matrix-vector multiplication using COO storage format. + - [csrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/csrmv/): Showcases a sparse matrix-vector multiplication using CSR storage format. + - [csrsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/csrsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix. + - [ellmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/ellmv/): Showcases a sparse matrix-vector multiplication using ELL storage format. + - [gebsrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/gebsrmv/): Showcases a sparse matrix-dense vector multiplication using GEBSR storage format. + - [gemvi](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/gemvi/): Showcases a dense matrix-sparse vector multiplication. + - [spmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/spmv/): Showcases a general sparse matrix-dense vector multiplication. + - [spsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/spsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a sparse triangular matrix. + - [level_3](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/): Operations between sparse and dense matrices. + - [bsrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/bsrmm/): Showcases a sparse matrix-matrix multiplication using BSR storage format. + - [bsrsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/bsrsm): Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices. + - [csrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/csrmm/): Showcases a sparse matrix-matrix multiplication using CSR storage format. + - [csrsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/csrsm): Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices. + - [gebsrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/gebsrmm/): Showcases a sparse matrix-matrix multiplication using GEBSR storage format. + - [gemmi](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/gemmi/): Showcases a dense matrix sparse matrix multiplication using CSR storage format. + - [sddmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/sddmm/): Showcases a sampled dense-dense matrix multiplication using CSR storage format. + - [spmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/spmm/): Showcases a sparse matrix-dense matrix multiplication. + - [spsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/spsm/): Showcases a sparse triangular linear system solver using CSR storage format. + - [preconditioner](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/): Manipulations on sparse matrices to obtain sparse preconditioner matrices. + - [bsric0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/bsric0/): Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix. + - [bsrilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/bsrilu0/): Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix. + - [csric0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csric0/): Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix. + - [csrilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csrilu0/): Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix. + - [csritilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csritilu0/): Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix. + - [rocThrust](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/) + - [device_ptr](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/device_ptr/): Simple program that showcases the usage of the `thrust::device_ptr` template. + - [norm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/norm/): An example that computes the Euclidean norm of a `thrust::device_vector`. + - [reduce_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/reduce_sum/): An example that computes the sum of a `thrust::device_vector` integer vector using the `thrust::reduce()` generalized summation and the `thrust::plus` operator. + - [remove_points](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/remove_points/): Simple program that demonstrates the usage of the `thrust` random number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle. + - [saxpy](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/saxpy/): Simple program that implements the SAXPY operation (`y[i] = a * x[i] + y[i]`) using rocThrust and showcases the usage of the vector and functor templates and of `thrust::fill` and `thrust::transform` operations. + - [vectors](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/vectors/): Simple program that showcases the `host_vector` and the `device_vector` of rocThrust. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - A number of examples also support building via GNU Make - available through the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) - For example-specific prerequisites, see the example subdirectories. ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - The Visual Studio ROCm extension needs to be installed to build with the solution files. @@ -139,11 +143,15 @@ A collection of examples to enable new users to start using ROCm. Advanced users - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building the example suite + ### Linux + These instructions assume that the prerequisites for every example are installed on the system. #### CMake + See [CMake build options](#cmake-build-options) for an overview of build options. + - `$ git clone https://github.com/ROCm/rocm-examples.git` - `$ cd rocm-examples` - `$ cmake -S . -B build` (on ROCm) or `$ cmake -S . -B build -D GPU_RUNTIME=CUDA` (on CUDA) @@ -151,15 +159,19 @@ See [CMake build options](#cmake-build-options) for an overview of build options - `$ cmake --install build --prefix install` #### Make + Beware that only a subset of the examples support building via Make. + - `$ git clone https://github.com/ROCm/rocm-examples.git` - `$ cd rocm-examples` - `$ make` (on ROCm) or `$ make GPU_RUNTIME=CUDA` (on CUDA) ### Linux with Docker + Alternatively, instead of installing the prerequisites on the system, the [Dockerfiles](https://github.com/ROCm/rocm-examples/tree/develop/Dockerfiles/) in this repository can be used to build images that provide all required prerequisites. Note, that the ROCm kernel GPU driver still needs to be installed on the host system. The following instructions showcase building the Docker image and full example suite inside the container using CMake: + - `$ git clone https://github.com/ROCm/rocm-examples.git` - `$ cd rocm-examples/Dockerfiles` - `$ docker build . -t rocm-examples -f hip-libraries-rocm-ubuntu.Dockerfile` (on ROCm) or `$ docker build . -t rocm-examples -f hip-libraries-cuda-ubuntu.Dockerfile` (on CUDA) @@ -170,11 +182,15 @@ The following instructions showcase building the Docker image and full example s - `# cmake --build build` The built executables can be found and run in the `build` directory: + - `# ./build/Libraries/rocRAND/simple_distributions_cpp/simple_distributions_cpp` ### Windows + #### Visual Studio + The repository has Visual Studio project files for all examples and individually for each example. + - Project files for Visual Studio are named as the example with `_vs` suffix added e.g. `device_sum_vs2019.sln` for the device sum example. - The project files can be built from Visual Studio or from the command line using MSBuild. - Use the build solution command in Visual Studio to build. @@ -185,6 +201,7 @@ The repository has Visual Studio project files for all examples and individually - The top level solution files come in two flavors: `ROCm-Examples-VS.sln` and `ROCm-Examples-Portable-VS.sln`. The former contains all examples, while the latter contains the examples that support both ROCm and CUDA. #### CMake + First, clone the repository and go to the source directory. ```shell @@ -195,6 +212,7 @@ cd rocm-examples There are two ways to build the project using CMake: with the Visual Studio Developer Command Prompt (recommended) or with a standard Command Prompt. See [CMake build options](#cmake-build-options) for an overview of build options. ##### Visual Studio Developer Command Prompt + Select Start, search for "x64 Native Tools Command Prompt for VS 2019", and the resulting Command Prompt. Ninja must be selected as generator, and Clang as C++ compiler. ```shell @@ -203,6 +221,7 @@ cmake --build build ``` ##### Standard Command Prompt + Run the standard Command Prompt. When using the standard Command Prompt to build the project, the Resource Compiler (RC) path must be specified. The RC is a tool used to build Windows-based applications, its default path is `C:/Program Files (x86)/Windows Kits/10/bin//x64/rc.exe`. Finally, the generator must be set to Ninja. ```shell @@ -211,6 +230,7 @@ cmake --build build ``` ### CMake build options + The following options are available when building with CMake. | Option | Relevant to | Default value | Description | |:---------------------------|:------------|:-----------------|:--------------------------------------------------------------------------------------------------------|