Skip to content

Commit 35d0c73

Browse files
committed
SWDEV-354898 - HIP documents patch for 5.4 release.
Change-Id: I52a20f69775ad06672321fdaa114dbee815a9838
1 parent b6ec0c8 commit 35d0c73

File tree

4 files changed

+130
-128
lines changed

4 files changed

+130
-128
lines changed

docs/markdown/hip_debugging.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ The following is the summary of the most useful environment variables in HIP.
262262
| AMD_SERIALIZE_COPY <br><sub> Serialize copies. </sub> | 0 | 1: Wait for completion before enqueue. <br> 2: Wait for completion after enqueue. <br> 3: Both. |
263263
| HIP_HOST_COHERENT <br><sub> Coherent memory in hipHostMalloc. </sub> | 0 | 0: memory is not coherent between host and GPU. <br> 1: memory is coherent with host. |
264264
| AMD_DIRECT_DISPATCH <br><sub> Enable direct kernel dispatch. </sub> | 1 | 0: Disable. <br> 1: Enable. |
265-
265+
| GPU_MAX_HW_QUEUES <br><sub> The maximum number of hardware queues allocated per device. </sub> | 4 | The variable controls how many independent hardware queues HIP runtime can create per process, per device. If application allocates more HIP streams than this number, then HIP runtime will reuse the same hardware queues for the new streams in round robin manner. Please note, this maximum number does not apply to either hardware queues that are created for CU masked HIP streams, or cooperative queue for HIP Cooperative Groups (there is only one single queue per device). |
266266

267267
## General Debugging Tips
268268
- 'gdb --args' can be used to conveniently pass the executable and arguments to gdb.

docs/markdown/hip_kernel_language.md

+14-2
Original file line numberDiff line numberDiff line change
@@ -455,9 +455,9 @@ Following is the list of supported integer intrinsics. Note that intrinsics are
455455
| unsigned int __popcll ( unsigned long long int x )<br><sub>Count the number of bits that are set to 1 in a 64 bit integer.</sub> |
456456
| int __mul24 ( int x, int y )<br><sub>Multiply two 24bit integers.</sub> |
457457
| unsigned int __umul24 ( unsigned int x, unsigned int y )<br><sub>Multiply two 24bit unsigned integers.</sub> |
458-
<sub><b id="f3"><sup>[1]</sup></b>
458+
<sub><b id="f3"><sup>[1]</sup></b>
459459
The HIP-Clang implementation of __ffs() and __ffsll() contains code to add a constant +1 to produce the ffs result format.
460-
For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform,
460+
For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform,
461461
HIP-Clang provides __lastbit_u32_u32(unsigned int input) and __lastbit_u32_u64(unsigned long long int input).
462462
The index returned by __lastbit_ instructions starts at -1, while for ffs the index starts at 0.
463463

@@ -496,6 +496,18 @@ long long int clock64()
496496
```
497497
Returns the value of counter that is incremented every clock cycle on device. Difference in values returned provides the cycles used.
498498

499+
```
500+
long long int wall_clock64()
501+
```
502+
Returns wall clock count at a constant frequency on the device, which can be queried via HIP API with hipDeviceAttributeWallClockRate attribute of the device in HIP application code, for example,
503+
```
504+
int wallClkRate = 0; //in kilohertz
505+
HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
506+
```
507+
Where hipDeviceAttributeWallClockRate is a device attribute.
508+
Note that, wall clock frequency is a per-device attribute.
509+
510+
499511
## Atomic Functions
500512

501513
Atomic functions execute as read-modify-write operations residing in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.

docs/markdown/hip_programming_guide.md

-3
Original file line numberDiff line numberDiff line change
@@ -102,9 +102,6 @@ A stronger system-level fence can be specified when the event is created with hi
102102
- hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded.  This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing.  Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem.
103103
- hipEventDisableTiming: Events created with this flag would not record profiling data and provide best performance if used for synchronization.
104104

105-
Note, for HIP Events used in kernel dispatch using hipExtLaunchKernelGGL/hipExtLaunchKernel, events passed in the API are not explicitly recorded and should only be used to get elapsed time for that specific launch.
106-
In case events are used across multiple dispatches, for example, start and stop events from different hipExtLaunchKernelGGL/hipExtLaunchKernel calls, they will be treated as invalid unrecorded events, HIP will throw error "hipErrorInvalidHandle" from hipEventElapsedTime.
107-
108105
### Summary and Recommendations:
109106

110107
- Coherent host memory is the default and is the easiest to use since the memory is visible to the CPU at typical synchronization points. This memory allows in-kernel synchronization commands such as threadfence_system to work transparently.

0 commit comments

Comments
 (0)