Threading clarification

peterchen-intel · web-flow · commit c10cb6d7b83d · 2024-09-28T10:18:41.000+08:00
diff --git a/llm_bench/python/README.md b/llm_bench/python/README.md
@@ -138,4 +138,10 @@ For example, --load_config config.json as following in OpenVINO 2024.0.0 will re
 > If you encounter any errors, please check **[NOTES.md](./doc/NOTES.md)** which provides solutions to the known errors.
 ### 2. Image generation
 > To configure more parameters for image generation models, reference to **[IMAGE_GEN.md](./doc/IMAGE_GEN.md)**
-### 3. Threading
+### 3. CPU Threading
+
+OpenVINO uses [oneTBB](https://github.com/oneapi-src/oneTBB/) as default threading library, while Torch uses [OpenMP](https://www.openmp.org/). Both threading libraries have ['busy-wait spin'](https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fSPINCOUNT.html) by default. So in LLM pipeline, when inference on CPU with OpenVINO and postprocessing with Torch(For example: greedy search or beam search), there is threading overhead in the switching between inference(OpenVINO with oneTBB) and postprocessing (Torch with OpenMP).
+
+**Alternative solutions**
+1. Use --genai option which uses OpenVINO genai APIs instead of optimum-intel APIs and executes postprocessing with OpenVINO genai APIs.
+2. Without --genai option which uses optimum-intel APIs, set environment variable [OMP_WAIT_POLICY](https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html) to PASSIVE which will disable OpenMP 'busy-wait', and benchmark.py will also limit the Torch thread number to avoid using CPU cores which is in 'busy-wait' by OpenVINO inference.