Skip to content

Commit

Permalink
Add more descriptive image captions.
Browse files Browse the repository at this point in the history
  • Loading branch information
dgaliffiAMD committed May 22, 2024
1 parent 818e3a1 commit 64cd534
Show file tree
Hide file tree
Showing 8 changed files with 9 additions and 9 deletions.
2 changes: 1 addition & 1 deletion Applications/bitonic_sort/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Each step $i$ of this bitonic sort implementation yields bitonic subsequences of

Below is presented an example of how an array of length 8 would be ordered increasingly. An arrow from one element to other means that those two elements are compared in the stage and step indicated in the left columns. The resulting order will be such that the lesser element will be placed at the position from which the arrow starts and the greater element will be placed at the position pointed by the end of the arrow. For an easier understanding, black arrows correspond to an increasing order and grey arrows to a decreasing order of the elements.

![bitonic_sort.svg](bitonic_sort.svg)
![A visual representation of sorting an array.](bitonic_sort.svg)

### Application flow

Expand Down
2 changes: 1 addition & 1 deletion Applications/prefix_sum/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ The algorithm used has two phases which are repeated:
Below is an example where the threads per block is 2.
In the first iteration ($\text{offset}=1$) we have 4 threads combining 8 items.

![prefix_sum_diagram.svg](prefix_sum_diagram.svg)
![A diagram illustrating a GPU implementation of a prefix sum via a scan algorithm](prefix_sum_diagram.svg)

### Application flow

Expand Down
2 changes: 1 addition & 1 deletion HIP-Basic/matrix_multiplication/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ This example showcases the multiplication of two dynamically sized two-dimension
6. The elements of the resulting matrix $\mathrm{C}$ are copied to the host and all device memory is freed.
7. The elements of $\mathrm{C}$ are compared with the expected result. The result of the comparison is printed to the standard output.

### Command line interface
## Command line interface

- If no command line argument is provided, the default matrix sizes are used.

Expand Down
2 changes: 1 addition & 1 deletion HIP-Basic/runtime_compilation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The diagram below summarizes the runtime compilation part of the example.
16. The module is unloaded from the current context and freed.
17. The first few elements of the result vector $y$ are printed to the standard output.

![hiprtc.svg](hiprtc.svg)
![A diagram to visualize the runtime compilation and launch of this example](hiprtc.svg)

## Key APIs and Concepts

Expand Down
4 changes: 2 additions & 2 deletions HIP-Basic/warp_shuffle/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ This example showcases how to use the above-mentioned operations by implementing
## Key APIs and Concepts

Warp shuffle is a warp-level primitive that allows for the communication between the threads of a warp. Below is a simple example that shows how the value of the thread with index 2 is copied to all other threads within the warp.
![warp_shuffle_simple.svg](warp_shuffle_simple.svg)
![An illustration of a single value being copied to other threads within the warp.](warp_shuffle_simple.svg)

`__shfl(var, src_lane, width = warp_size)` copies the value of a `var` from the thread `src_lane` within the warp. This operation admits a third parameter (not used in this example), `width`, defaulted to the warp size value and which allows restricting the number of threads of the warp from which values are read. Values are copied from threads with an ID in the range $[0, width-1]$. If the ID of the thread specified in the call to `__shfl` is out of that range, then the thread accessed is the one with that ID modulo `width`. The `src_lane` may also vary per thread, as shown below.

![warp_shuffle.svg](warp_shuffle.svg)
![A more complex illustration of warp shuffle, which includes a variable source.](warp_shuffle.svg)

- `hipGetDeviceProperties` gets the properties of the specified device. In this example, it is used to get the warp size of the device (GPU) used.
- `hipMalloc` allocates memory in the global memory of the device, and with `hipMemcpy` data bytes can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`), among others.
Expand Down
2 changes: 1 addition & 1 deletion Libraries/rocBLAS/level_1/swap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ The application provides the following optional command line arguments:

- `rocblas_set_vector(n, elem_size, *x, incx, *y, incy)` is used to copy vectors from host to device memory. `n` is the total number of elements that should be copied, and `elem_size` is the size of a single element in bytes. The elements are copied from `x` to `y`, where the step size between consecutive elements of `x` and `y` is given respectively by `incx` and `incy`. Note that the increment is given in elements, not bytes. Additionally, the step size of either `x`, `y`, or both may also be negative. In this case care must be taken that the correct pointer is passed to `rocblas_set_vector`, as it is not automatically adjusted to the end of the input vector. When `incx` and `incy` are 1, calling this function is equivalent to `hipMemcpy(y, x, n * elem_size, hipMemcpyHostToDevice)`. See the following diagram , which illustrates `rocblas_set_vector(3, sizeof(T), x, incx, y, incy)`:

![set_get_vector.svg](set_get_vector.svg)
![An illustration of rocblas_set_vector execution.](set_get_vector.svg)

- `rocblas_get_vector(n, elem_size, *x, incx, *y, incy)` is used to copy vectors from device to host memory. Its arguments are similar to `rocblas_set_vector`. Elements are also copied from `x` to `y`.

Expand Down
2 changes: 1 addition & 1 deletion Libraries/rocBLAS/level_2/her/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ The application provides the following optional command line arguments:

- `rocblas_[cz]her(handle, uplo, n, *alpha, *x, incx, *A, lda)` computes a Hermitian rank-1 update, defined as $A = A + \alpha \cdot x \cdot x ^ H$, where $A$ is an $n \times n$ Hermitian matrix, and $x$ is a complex vector of $n$ elements. The character matched in `[cz]` denotes the data type of the operation, and can either be `c` (complex float: `rocblas_complex_float`), or `z` (complex double: `rocblas_complex_double`). Because a Hermitian matrix is symmetric over the diagonal, except that the values in the upper triangle are the complex conjugate of the values in the lower triangle, the required work is reduced by only updating a single half of the matrix. The part of the matrix to update is given by `uplo`: `rocblas_fill_upper` indicates that the upper triangle of $A$ should be updated, and `rocblas_fill_lower` indicates that the lower triangle should be updated. Values in the other triangle are not altered. `n` gives the dimensions of $x$ and $A$, and `incx` the increment in elements between items of $x$. `lda` is the _leading dimension_ of $A$: the number of elements between the starts of columns of $A$. The elements of each column of $A$ are packed in memory. Note that rocBLAS matrices are laid out in _column major_ ordering. See the following figure, which illustrates the memory layout of a matrix with 3 rows and 2 columns:<br>

![matrix-layout.svg](matrix-layout.svg)
![Memory layout of a 3 x 2 matrix.](matrix-layout.svg)

- `hipFloatComplex`, `std::complex<float>`, and `rocblas_float_complex` have compatible memory layout, and performing a memory copy between values of these types will correctly perform the expected copy.

Expand Down
2 changes: 1 addition & 1 deletion Libraries/rocBLAS/level_3/gemm_strided_batched/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ The application provides the following optional command line arguments:

We can apply the same multiplication operator for several matrices if we combine them into batched matrices. Batched matrix multiplication has a performance improvement for a large number of small matrices. For a constant stride between matrices, further acceleration is available by strided batched GEMM.

![strided-matrix-layout.svg](strided-matrix-layout.svg)
![Layout of two batched matrices, each with 3 x 2 elements, and a stride of 9](strided-matrix-layout.svg)

- rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`.

Expand Down

0 comments on commit 64cd534

Please sign in to comment.