Merge pull request #306 from pavel-esir/update_readme

ilya-lavrenov · web-flow · commit d6d4f00e24d9 · 2024-03-14T13:46:57.000+04:00
remove disable-statefull from speculative decoding Readme
diff --git a/text_generation/causal_lm/cpp/README.md b/text_generation/causal_lm/cpp/README.md
@@ -44,11 +44,8 @@ Speculative decoding works the following way. The draft model predicts the next
 
 This approach reduces the need for multiple infer requests to the main model, enhancing performance. For instance, in more predictable parts of text generation, the draft model can, in best-case scenarios, generate the next K tokens that exactly match the target. In tha caste the are validated in a single inference request to the main model (which is bigger, more accurate but slower) instead of running K subsequent requests. More details can be found in the original paper https://arxiv.org/pdf/2211.17192.pdf, https://arxiv.org/pdf/2302.01318.pdf
 
-Important note: models should belong to the same familiy and have same tokenizers, and they both should be converted with `--disable-stateful`, e.g.:
-
-```sh
-python3 ../../../llm_bench/python/convert.py --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir ./TinyLlama-1.1B-Chat-v1.0/ --precision FP16
-```
+> [!NOTE]
+>Models should belong to the same family and have same tokenizers.
 
 ## Install OpenVINO