Skip to content

Commit 233b8a8

Browse files
committed
SWDEV-340007 - Add per-thread stream support in hip documents
Change-Id: Ib32d768b296966fc28e6e30875e8fda366e6eff7
1 parent ab1c67c commit 233b8a8

File tree

3 files changed

+33
-10
lines changed

3 files changed

+33
-10
lines changed

docs/markdown/hip_faq.md

+14-1
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
- [Why _OpenMP is undefined when compiling with -fopenmp?](#why-_openmp-is-undefined-when-compiling-with--fopenmp)
3434
- [Does the HIP-Clang compiler support extern shared declarations?](#does-the-hip-clang-compiler-support-extern-shared-declarations)
3535
- [I have multiple HIP enabled devices and I am getting an error message hipErrorNoBinaryForGpu: Unable to find code object for all current devices?](#i-have-multiple-hip-enabled-devices-and-i-am-getting-an-error-message-hipErrorNoBinaryForGpu-unable-to-find-code-object-for-all-current-devices)
36+
- [How to use per-thread default stream in HIP?](#how-to-use-per-thread-default-stream-in-hip)
3637
- [How can I know the version of HIP?](#how-can-I-know-the-version-of-hip)
3738
<!-- tocstop -->
3839

@@ -94,7 +95,7 @@ However, we can provide a rough summary of the features included in each CUDA SD
9495
- CUDA 6.5 :
9596
- __shfl intriniscs (supported)
9697
- CUDA 7.0 :
97-
- Per-thread-streams (under development)
98+
- Per-thread default streams (supported)
9899
- C++11 (Hip-Clang supports all of C++11, all of C++14 and some C++17 features)
99100
- CUDA 7.5 :
100101
- float16 (supported)
@@ -260,6 +261,18 @@ If you have a precompiled application/library (like rocblas, tensorflow etc) whi
260261
- The application/library does not ship code object bundles for *all* of your device(s): in this case you need to recompile the application/library yourself with correct `--offload-arch`.
261262
- The application/library does not ship code object bundles for *some* of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable `HIP_VISIBLE_DEVICES` to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run.
262263

264+
### How to use per-thread default stream in HIP?
265+
266+
The per-thread default stream is an implicit stream local to both the thread and the current device. It does not do any implicit synchronization with other streams (like explicitly created streams), or default per-thread stream on other threads.
267+
268+
The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
269+
270+
In ROCm, a compilation option should be added in order to compile the translation unit with per-thread default stream enabled.
271+
“-fgpu-default-stream=per-thread”.
272+
Once source is compiled with per-thread default stream enabled, all APIs will be executed on per thread default stream, hence there will not be any implicit synchronization with other streams.
273+
274+
Besides, per-thread default stream be enabled per translation unit, users can compile some files with feature enabled and some with feature disabled. Feature enabled translation unit will have default stream as per thread and there will not be any implicit synchronization done but other modules will have legacy default stream which will do implicit synchronization.
275+
263276
### How can I know the version of HIP?
264277

265278
HIP version definition has been updated since ROCm 4.2 release as the following:

docs/markdown/hip_programming_guide.md

+9
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,15 @@ This implementation does not require the use of `hipDeviceSetLimit(hipLimitMallo
139139

140140
The test codes in the link (https://github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/src/deviceLib/hipDeviceMalloc.cpp) show how to implement application using malloc and free functions in device kernels.
141141

142+
## Use of Per-thread default stream
143+
144+
The per-thread default stream is supported in HIP. It is an implicit stream local to both the thread and the current device. This means that the command issued to the per-thread default stream by the thread does not implicitly synchronize with other streams (like explicitly created streams), or default per-thread stream on other threads.
145+
The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
146+
The per-thread default stream can be enabled via adding a compilation option,
147+
“-fgpu-default-stream=per-thread”.
148+
149+
And users can explicitly use "hipStreamPerThread" as per-thread default stream handle as input in API commands. There are test codes as examples in the link (https://github.com/ROCm-Developer-Tools/HIP/tree/develop/tests/catch/unit/streamperthread).
150+
142151
## Use of Long Double Type
143152

144153
In HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HIP-Clang treats long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type.

include/hip/hip_runtime_api.h

+10-9
Original file line numberDiff line numberDiff line change
@@ -1246,7 +1246,7 @@ hipError_t hipInit(unsigned int flags);
12461246
*
12471247
* @param [out] driverVersion
12481248
*
1249-
* @returns #hipSuccess, #hipErrorInavlidValue
1249+
* @returns #hipSuccess, #hipErrorInvalidValue
12501250
*
12511251
* @warning The HIP feature set does not correspond to an exact CUDA SDK driver revision.
12521252
* This function always set *driverVersion to 4 as an approximation though HIP supports
@@ -1262,7 +1262,8 @@ hipError_t hipDriverGetVersion(int* driverVersion);
12621262
*
12631263
* @param [out] runtimeVersion
12641264
*
1265-
* @returns #hipSuccess, #hipErrorInavlidValue
1265+
* @returns #hipSuccess, #hipErrorInvalidValue
1266+
*
12661267
*
12671268
* @warning The version definition of HIP runtime is different from CUDA.
12681269
* On AMD platform, the function returns HIP runtime version,
@@ -1277,7 +1278,7 @@ hipError_t hipRuntimeGetVersion(int* runtimeVersion);
12771278
* @param [out] device
12781279
* @param [in] ordinal
12791280
*
1280-
* @returns #hipSuccess, #hipErrorInavlidDevice
1281+
* @returns #hipSuccess, #hipErrorInvalidDevice
12811282
*/
12821283
hipError_t hipDeviceGet(hipDevice_t* device, int ordinal);
12831284

@@ -1287,7 +1288,7 @@ hipError_t hipDeviceGet(hipDevice_t* device, int ordinal);
12871288
* @param [out] minor
12881289
* @param [in] device
12891290
*
1290-
* @returns #hipSuccess, #hipErrorInavlidDevice
1291+
* @returns #hipSuccess, #hipErrorInvalidDevice
12911292
*/
12921293
hipError_t hipDeviceComputeCapability(int* major, int* minor, hipDevice_t device);
12931294
/**
@@ -1296,7 +1297,7 @@ hipError_t hipDeviceComputeCapability(int* major, int* minor, hipDevice_t device
12961297
* @param [in] len
12971298
* @param [in] device
12981299
*
1299-
* @returns #hipSuccess, #hipErrorInavlidDevice
1300+
* @returns #hipSuccess, #hipErrorInvalidDevice
13001301
*/
13011302
hipError_t hipDeviceGetName(char* name, int len, hipDevice_t device);
13021303
/**
@@ -1318,7 +1319,7 @@ hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device);
13181319
* @param [in] srcDevice
13191320
* @param [in] dstDevice
13201321
*
1321-
* @returns #hipSuccess, #hipErrorInavlidDevice
1322+
* @returns #hipSuccess, #hipErrorInvalidDevice
13221323
*/
13231324
hipError_t hipDeviceGetP2PAttribute(int* value, hipDeviceP2PAttr attr,
13241325
int srcDevice, int dstDevice);
@@ -1328,23 +1329,23 @@ hipError_t hipDeviceGetP2PAttribute(int* value, hipDeviceP2PAttr attr,
13281329
* @param [in] len
13291330
* @param [in] device
13301331
*
1331-
* @returns #hipSuccess, #hipErrorInavlidDevice
1332+
* @returns #hipSuccess, #hipErrorInvalidDevice
13321333
*/
13331334
hipError_t hipDeviceGetPCIBusId(char* pciBusId, int len, int device);
13341335
/**
13351336
* @brief Returns a handle to a compute device.
13361337
* @param [out] device handle
13371338
* @param [in] PCI Bus ID
13381339
*
1339-
* @returns #hipSuccess, #hipErrorInavlidDevice, #hipErrorInvalidValue
1340+
* @returns #hipSuccess, #hipErrorInvalidDevice, #hipErrorInvalidValue
13401341
*/
13411342
hipError_t hipDeviceGetByPCIBusId(int* device, const char* pciBusId);
13421343
/**
13431344
* @brief Returns the total amount of memory on the device.
13441345
* @param [out] bytes
13451346
* @param [in] device
13461347
*
1347-
* @returns #hipSuccess, #hipErrorInavlidDevice
1348+
* @returns #hipSuccess, #hipErrorInvalidDevice
13481349
*/
13491350
hipError_t hipDeviceTotalMem(size_t* bytes, hipDevice_t device);
13501351
// doxygen end initialization

0 commit comments

Comments
 (0)