Updating references to Generative AI guide

sgolebiewski-intel · sgolebiewski-intel · commit ccc761055e43 · 2025-01-20T15:19:13.000+01:00
Signed-off-by: sgolebiewski-intel &lt;sebastianx.golebiewski@intel.com&gt;
diff --git a/notebooks/image-to-image-genai/README.md b/notebooks/image-to-image-genai/README.md
@@ -5,7 +5,7 @@ Image-to-image is the task of transforming an input image through a variety of p
 One of the most popular use cases of image-to-image is style transfer. With style transfer models:
   * a regular photo can be transformed into a variety of artistic styles or genres, such as a watercolor painting, a comic book illustration and more.
   * new images can be generated using a text prompt, in the style of a reference input image.
-  
+
 Latent diffusion models can be used for performing image-to-image generation. Diffusion-based Image-to-image is similar to [text-to-image](../text-to-image-genai/text-to-image-genai.ipynb), but in addition to a prompt, you can also pass an initial image as a starting point for the diffusion process. The initial image is encoded to latent space and noise is added to it. Then the latent diffusion model takes a prompt and the noisy latent image, predicts the added noise, and removes the predicted noise from the initial latent image to get the new latent image. Lastly, a decoder decodes the new latent image back into an image.
 
 ![pipe.png](https://user-images.githubusercontent.com/29454499/260981188-c112dd0a-5752-4515-adca-8b09bea5d14a.png)
@@ -18,15 +18,15 @@ In this tutorial, we consider how to use OpenVINO GenAI for performing image-to-
 
 This library is friendly to PC and laptop execution, and optimized for resource consumption. It requires no external dependencies to run generative models as it already includes all the core functionality (e.g. tokenization via openvino-tokenizers).
 
-OpenVINO GenAI supports popular diffusion models like Stable Diffusion or SDXL for performing image generation. You can find supported models list in [OpenVINO GenAI documentation](https://github.com/openvinotoolkit/openvino.genai/blob/master/SUPPORTED_MODELS.md#image-generation-models). Previously, we considered how to run text-to-image generation with OpenVINO GenAI and apply multiple LoRA adapters, mow is image-to-image. 
+OpenVINO GenAI supports popular diffusion models like Stable Diffusion or SDXL for performing image generation. You can find supported models list in [OpenVINO GenAI documentation](https://github.com/openvinotoolkit/openvino.genai/blob/master/SUPPORTED_MODELS.md#image-generation-models). Previously, we considered how to run text-to-image generation with OpenVINO GenAI and apply multiple LoRA adapters, mow is image-to-image.
 
 ## Notebook Contents
 
-In this notebook we will demonstrate how to use Latent Diffusion models like Stable Diffusion 1.5, 2.1, LCM, SDXL for image to image generation using OpenVINO GenAI Image2ImagePipeline. 
-All it takes is two steps: 
+In this notebook we will demonstrate how to use Latent Diffusion models like Stable Diffusion 1.5, 2.1, LCM, SDXL for image to image generation using OpenVINO GenAI Image2ImagePipeline.
+All it takes is two steps:
 1. Export OpenVINO IR format model using the [Hugging Face Optimum](https://huggingface.co/docs/optimum/installation) library accelerated by OpenVINO integration.
 The Hugging Face Optimum Intel API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the [Hugging Face Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/inference).
-1. Run inference using the standard [Image-to-Image Generation pipeline](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide.html) from OpenVINO GenAI.
+1. Run inference using the standard [Image-to-Image Generation pipeline](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai.html) from OpenVINO GenAI.
 
 The tutorial consists of following steps:
 - Install prerequisites
@@ -38,7 +38,7 @@ The tutorial consists of following steps:
 - Explore advanced options for generation results improvement
 - Launch interactive demo
 
-![](https://github.com/user-attachments/assets/280736ea-d51a-43f3-a1ae-21af5831005f) 
+![](https://github.com/user-attachments/assets/280736ea-d51a-43f3-a1ae-21af5831005f)
 
 
 ## Installation Instructions
diff --git a/notebooks/llm-agent-functioncall/llm-agent-functioncall-qwen.ipynb b/notebooks/llm-agent-functioncall/llm-agent-functioncall-qwen.ipynb
@@ -305,7 +305,7 @@
    "id": "d70905e2",
    "metadata": {},
    "source": [
-    "You can get additional inference speed improvement with [Dynamic Quantization of activations and KV-cache quantization on CPU](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-hf.html#enabling-openvino-runtime-optimizations). These options can be enabled with `ov_config` as follows:"
+    "You can get additional inference speed improvement with [Dynamic Quantization of activations and KV-cache quantization on CPU](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-optimum-intel.html#enabling-openvino-runtime-optimizations). These options can be enabled with `ov_config` as follows:"
    ]
   },
   {
diff --git a/notebooks/llm-agent-react/llm-agent-react-langchain.ipynb b/notebooks/llm-agent-react/llm-agent-react-langchain.ipynb
@@ -556,7 +556,7 @@
    "id": "d70905e2",
    "metadata": {},
    "source": [
-    "You can get additional inference speed improvement with [Dynamic Quantization of activations and KV-cache quantization on CPU](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-hf.html#enabling-openvino-runtime-optimizations). These options can be enabled with `ov_config` as follows:"
+    "You can get additional inference speed improvement with [Dynamic Quantization of activations and KV-cache quantization on CPU](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-optimum-intel.html#enabling-openvino-runtime-optimizations). These options can be enabled with `ov_config` as follows:"
    ]
   },
   {
diff --git a/notebooks/llm-agent-react/llm-agent-react.ipynb b/notebooks/llm-agent-react/llm-agent-react.ipynb
@@ -240,7 +240,7 @@
     "Model class initialization starts with calling `from_pretrained` method. When downloading and converting Transformers model, the parameter `export=True` should be added (as we already converted model before, we do not need to provide this parameter). We can save the converted model for the next usage with the `save_pretrained` method.\n",
     "Tokenizer class and pipelines API are compatible with Optimum models.\n",
     "\n",
-    "You can find more details about OpenVINO LLM inference using HuggingFace Optimum API in [LLM inference guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html)."
+    "You can find more details about OpenVINO LLM inference using HuggingFace Optimum API in [LLM inference guide](https://docs.openvino.ai/2025/openvino-workflow-generative.html)."
    ]
   },
   {
diff --git a/notebooks/llm-chatbot/llm-chatbot-generate-api.ipynb b/notebooks/llm-chatbot/llm-chatbot-generate-api.ipynb
@@ -193,7 +193,7 @@
     "* **INT8** is an 8-bit weight-only quantization provided by [NNCF](https://github.com/openvinotoolkit/nncf): This method compresses weights to an 8-bit integer data type, which balances model size reduction and accuracy, making it a versatile option for a broad range of applications.\n",
     "* **INT4** is an 4-bit weight-only quantization provided by [NNCF](https://github.com/openvinotoolkit/nncf). involves quantizing weights to an unsigned 4-bit integer symmetrically around a fixed zero point of eight (i.e., the midpoint between zero and 15). in case of **symmetric quantization** or asymmetrically with a non-fixed zero point, in case of **asymmetric quantization** respectively. Compared to INT8 compression, INT4 compression improves performance even more, but introduces a minor drop in prediction quality. INT4 it ideal for situations where speed is prioritized over an acceptable trade-off against accuracy.\n",
     "* **INT4 AWQ** is an 4-bit activation-aware weight quantization. [Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We will use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
-    "* **INT4 NPU-friendly** is an 4-bit channel-wise quantization. This approach is [recommended](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) for LLM inference using NPU.\n",
+    "* **INT4 NPU-friendly** is an 4-bit channel-wise quantization. This approach is [recommended](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai-on-npu.html) for LLM inference using NPU.\n",
     "\n",
     "<details>\n",
     "  <summary><b>Click here to see available models options</b></summary>\n",
@@ -624,7 +624,7 @@
     "\n",
     "The difference between chatbot and instruction-following pipelines is that the model should have \"memory\" to find correct answers on the chain of connected questions. OpenVINO GenAI uses `KVCache` representation for maintain a history of conversation. By default, `LLMPipeline` resets `KVCache` after each `generate` call. To keep conversational history, we should move LLMPipeline to chat mode using `start_chat()` method.\n",
     "\n",
-    "More info about OpenVINO LLM inference can be found in [LLM Inference Guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html)\n",
+    "More info about OpenVINO LLM inference can be found in [LLM Inference Guide](https://docs.openvino.ai/2025/openvino-workflow-generative.html)\n",
     "</details>"
    ]
   },
diff --git a/notebooks/llm-chatbot/llm-chatbot.ipynb b/notebooks/llm-chatbot/llm-chatbot.ipynb
@@ -904,7 +904,7 @@
     "Model class initialization starts with calling `from_pretrained` method. When downloading and converting Transformers model, the parameter `export=True` should be added (as we already converted model before, we do not need to provide this parameter). We can save the converted model for the next usage with the `save_pretrained` method.\n",
     "Tokenizer class and pipelines API are compatible with Optimum models.\n",
     "\n",
-    "You can find more details about OpenVINO LLM inference using HuggingFace Optimum API in [LLM inference guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html)."
+    "You can find more details about OpenVINO LLM inference using HuggingFace Optimum API in [LLM inference guide](https://docs.openvino.ai/2025/openvino-workflow-generative.html)."
    ]
   },
   {
@@ -1010,7 +1010,7 @@
     "\n",
     "![generation pipeline](https://user-images.githubusercontent.com/29454499/255523209-d9336491-c7ba-4dc1-98f0-07f23743ce89.png)\n",
     "\n",
-    "As can be seen, the pipeline very similar to instruction-following with only changes that previous conversation history additionally passed as input with next user question for getting wider input context. On the first iteration, the user provided instructions joined to conversation history (if exists) converted to token ids using a tokenizer, then prepared input provided to the model. The model generates probabilities for all tokens in logits format  The way the next token will be selected over predicted probabilities is driven by the selected decoding methodology. You can find more information about the most popular decoding methods in this [blog](https://huggingface.co/blog/how-to-generate). The result generation updates conversation history for next conversation step. it makes stronger connection of next question with previously provided and allows user to make clarifications regarding previously provided answers.https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html"
+    "As can be seen, the pipeline very similar to instruction-following with only changes that previous conversation history additionally passed as input with next user question for getting wider input context. On the first iteration, the user provided instructions joined to conversation history (if exists) converted to token ids using a tokenizer, then prepared input provided to the model. The model generates probabilities for all tokens in logits format  The way the next token will be selected over predicted probabilities is driven by the selected decoding methodology. You can find more information about the most popular decoding methods in this [blog](https://huggingface.co/blog/how-to-generate). The result generation updates conversation history for next conversation step. it makes stronger connection of next question with previously provided and allows user to make clarifications regarding previously provided answers.https://docs.openvino.ai/2025/openvino-workflow-generative.html"
    ]
   },
   {
@@ -1038,7 +1038,7 @@
     "    - **Medium top_p** (e.g., 0.8): The AI model considers tokens with a higher cumulative probability, such as \"playing,\" \"sleeping,\" and \"eating.\"  \n",
     "    - **High top_p** (e.g., 1.0): The AI model considers all tokens, including those with lower probabilities, such as \"driving\" and \"flying.\" \n",
     "  * `Top-k` is an another popular sampling strategy. In comparison with Top-P, which chooses from the smallest possible set of words whose cumulative probability exceeds the probability P, in Top-K sampling K most likely next words are filtered and the probability mass is redistributed among only those K next words. In our example with cat, if k=3, then only \"playing\", \"sleeping\" and \"eating\" will be taken into account as possible next word.\n",
-    "  * `Repetition Penalty` This parameter can help penalize tokens based on how frequently they occur in the text, including the input prompt. A token that has already appeared five times is penalized more heavily than a token that has appeared only one time. A value of 1 means that there is no penalty and values larger than 1 discourage repeated tokens.https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html"
+    "  * `Repetition Penalty` This parameter can help penalize tokens based on how frequently they occur in the text, including the input prompt. A token that has already appeared five times is penalized more heavily than a token that has appeared only one time. A value of 1 means that there is no penalty and values larger than 1 discourage repeated tokens.https://docs.openvino.ai/2025/openvino-workflow-generative.html"
    ]
   },
   {
diff --git a/notebooks/multilora-image-generation/multilora-image-generation.ipynb b/notebooks/multilora-image-generation/multilora-image-generation.ipynb
@@ -104,7 +104,7 @@
     "LoRA can be easily added to [Diffusers pipeline](https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading_adapters#lora) before export. At the export stage, LoRA weights will be fused to original model weights and converted model will preserve LoRA provided behavior. This approach is suitable when you need model with adapter capabilities by default and it does not required configuration at inference time (e.g. changing weight coefficient for adapter).\n",
     "For example, we can use this method for speedup generation process with integration [LCM LoRA](https://huggingface.co/blog/lcm_lora). Previously, we already considered with approach in this [tutorial](../latent-consistency-models-image-generation/lcm-lora-controlnet.ipynb).\n",
     "\n",
-    "Using `optimum-cli` for exporting models requires to provide model id on HuggingFace Hub or local directory with saved model. In case, if model stored in multiple separated repositories or directories (e.g. you want to replace VAE component or add LoRA), it should be merged and saved on disk before export. For avoiding this, we will use `export_from_model` function that accepts initialized model. Additionally, for using model with OpenVINO GenAI, we need to export tokenizers to OpenVINO format using [OpenVINO Tokenizers](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/ov-tokenizers.html) library.\n",
+    "Using `optimum-cli` for exporting models requires to provide model id on HuggingFace Hub or local directory with saved model. In case, if model stored in multiple separated repositories or directories (e.g. you want to replace VAE component or add LoRA), it should be merged and saved on disk before export. For avoiding this, we will use `export_from_model` function that accepts initialized model. Additionally, for using model with OpenVINO GenAI, we need to export tokenizers to OpenVINO format using [OpenVINO Tokenizers](https://docs.openvino.ai/2025/openvino-workflow-generative/ov-tokenizers.html) library.\n",
     "\n",
     "In this tutorial we will use [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model, but the same steps are also applicable to other models of Stable Diffusion family."
    ]
diff --git a/notebooks/speculative-sampling/speculative-sampling.ipynb b/notebooks/speculative-sampling/speculative-sampling.ipynb
@@ -90,7 +90,7 @@
     "As example, we will use already converted LLMs from [OpenVINO collection](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd).\n",
     "You can find OpenVINO optimized FastDraft models can be found in this [collection](https://huggingface.co/collections/OpenVINO/speculative-decoding-draft-models-673f5d944d58b29ba6e94161). As example we will use [Phi-3-mini-4k-instruct-int4-ov](https://huggingface.co/OpenVINO/Phi-3-mini-4k-instruct-int4-ov) as target model and [Phi-3-mini-FastDraft-50M-int8-ov](https://huggingface.co/OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov) as draft.\n",
     "\n",
-    "In case, if you want run own models, you should convert them using [Hugging Face Optimum](https://huggingface.co/docs/optimum/intel/openvino/export) library accelerated by OpenVINO integration. More details about model preparation can be found in [OpenVINO LLM inference guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-native-ov.html#convert-hugging-face-tokenizer-and-model-to-openvino-ir-format)"
+    "In case, if you want run own models, you should convert them using [Hugging Face Optimum](https://huggingface.co/docs/optimum/intel/openvino/export) library accelerated by OpenVINO integration. More details about model preparation can be found in [OpenVINO LLM inference guide](https://docs.openvino.ai/2025/openvino-workflow-generative/genai-model-preparation.html)"
    ]
   },
   {
diff --git a/notebooks/text-to-image-genai/README.md b/notebooks/text-to-image-genai/README.md
@@ -8,10 +8,10 @@ In this tutorial we consider how to use OpenVINO GenAI for image generation scen
 
 ## Notebook Contents
 
-In this notebook we will demonstrate how to use text to image models like Stable Diffusion 1.5, 2.1, LCM using [Dreamlike Anime 1.0](https://huggingface.co/dreamlike-art/dreamlike-anime-1.0) as an example. All it takes is two steps: 
+In this notebook we will demonstrate how to use text to image models like Stable Diffusion 1.5, 2.1, LCM using [Dreamlike Anime 1.0](https://huggingface.co/dreamlike-art/dreamlike-anime-1.0) as an example. All it takes is two steps:
 1. Export OpenVINO IR format model using the [Hugging Face Optimum](https://huggingface.co/docs/optimum/installation) library accelerated by OpenVINO integration.
 The Hugging Face Optimum Intel API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the [Hugging Face Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/inference).
-2. Run inference using the standard [Text-to-Image Generation pipeline](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide.html) from OpenVINO GenAI.
+2. Run inference using the standard [Text-to-Image Generation pipeline](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai.html) from OpenVINO GenAI.
 
 The tutorial consists of following steps:
 - Prerequisites
diff --git a/notebooks/text-to-image-genai/text-to-image-genai.ipynb b/notebooks/text-to-image-genai/text-to-image-genai.ipynb
@@ -15,7 +15,7 @@
     "In this notebook we will demonstrate how to use text to image models like Stable Diffusion 1.5, 2.1, LCM using [Dreamlike Anime 1.0](https://huggingface.co/dreamlike-art/dreamlike-anime-1.0) as an example. All it takes is two steps: \n",
     "1. Export OpenVINO IR format model using the [Hugging Face Optimum](https://huggingface.co/docs/optimum/installation) library accelerated by OpenVINO integration.\n",
     "The Hugging Face Optimum Intel API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the [Hugging Face Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/inference).\n",
-    "2. Run inference using the [Text-to-Image Generation pipeline](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide.html) from OpenVINO GenAI.\n",
+    "2. Run inference using the [Text-to-Image Generation pipeline](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai.html) from OpenVINO GenAI.\n",
     "\n",
     "\n",
     "\n",

Original file line number	Diff line number	Diff line change
`@@ -305,7 +305,7 @@`
`305`	`305`	`"id": "d70905e2",`
`306`	`306`	`"metadata": {},`
`307`	`307`	`"source": [`
`308`		- "You can get additional inference speed improvement with [Dynamic Quantization of activations and KV-cache quantization on CPU](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-hf.html#enabling-openvino-runtime-optimizations). These options can be enabled with `ov_config` as follows:"
	`308`	+ "You can get additional inference speed improvement with [Dynamic Quantization of activations and KV-cache quantization on CPU](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-optimum-intel.html#enabling-openvino-runtime-optimizations). These options can be enabled with `ov_config` as follows:"
`309`	`309`	`]`
`310`	`310`	`},`
`311`	`311`	`{`
Original file line number	Diff line number	Diff line change
`@@ -556,7 +556,7 @@`
`556`	`556`	`"id": "d70905e2",`
`557`	`557`	`"metadata": {},`
`558`	`558`	`"source": [`
`559`		- "You can get additional inference speed improvement with [Dynamic Quantization of activations and KV-cache quantization on CPU](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-hf.html#enabling-openvino-runtime-optimizations). These options can be enabled with `ov_config` as follows:"
	`559`	+ "You can get additional inference speed improvement with [Dynamic Quantization of activations and KV-cache quantization on CPU](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-optimum-intel.html#enabling-openvino-runtime-optimizations). These options can be enabled with `ov_config` as follows:"
`560`	`560`	`]`
`561`	`561`	`},`
`562`	`562`	`{`
Original file line number	Diff line number	Diff line change
`@@ -240,7 +240,7 @@`
`240`	`240`	"Model class initialization starts with calling `from_pretrained` method. When downloading and converting Transformers model, the parameter `export=True` should be added (as we already converted model before, we do not need to provide this parameter). We can save the converted model for the next usage with the `save_pretrained` method.\n",
`241`	`241`	`"Tokenizer class and pipelines API are compatible with Optimum models.\n",`
`242`	`242`	`"\n",`
`243`		`- "You can find more details about OpenVINO LLM inference using HuggingFace Optimum API in [LLM inference guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html)."`
	`243`	`+ "You can find more details about OpenVINO LLM inference using HuggingFace Optimum API in [LLM inference guide](https://docs.openvino.ai/2025/openvino-workflow-generative.html)."`
`244`	`244`	`]`
`245`	`245`	`},`
`246`	`246`	`{`
Original file line number	Diff line number	Diff line change
`@@ -904,7 +904,7 @@`
`904`	`904`	"Model class initialization starts with calling `from_pretrained` method. When downloading and converting Transformers model, the parameter `export=True` should be added (as we already converted model before, we do not need to provide this parameter). We can save the converted model for the next usage with the `save_pretrained` method.\n",
`905`	`905`	`"Tokenizer class and pipelines API are compatible with Optimum models.\n",`
`906`	`906`	`"\n",`
`907`		`- "You can find more details about OpenVINO LLM inference using HuggingFace Optimum API in [LLM inference guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html)."`
	`907`	`+ "You can find more details about OpenVINO LLM inference using HuggingFace Optimum API in [LLM inference guide](https://docs.openvino.ai/2025/openvino-workflow-generative.html)."`
`908`	`908`	`]`
`909`	`909`	`},`
`910`	`910`	`{`
`@@ -1010,7 +1010,7 @@`
`1010`	`1010`	`"\n",`
`1011`	`1011`	`"![generation pipeline](https://user-images.githubusercontent.com/29454499/255523209-d9336491-c7ba-4dc1-98f0-07f23743ce89.png)\n",`
`1012`	`1012`	`"\n",`
`1013`		- "As can be seen, the pipeline very similar to instruction-following with only changes that previous conversation history additionally passed as input with next user question for getting wider input context. On the first iteration, the user provided instructions joined to conversation history (if exists) converted to token ids using a tokenizer, then prepared input provided to the model. The model generates probabilities for all tokens in logits format The way the next token will be selected over predicted probabilities is driven by the selected decoding methodology. You can find more information about the most popular decoding methods in this [blog](https://huggingface.co/blog/how-to-generate). The result generation updates conversation history for next conversation step. it makes stronger connection of next question with previously provided and allows user to make clarifications regarding previously provided answers.https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html"
	`1013`	+ "As can be seen, the pipeline very similar to instruction-following with only changes that previous conversation history additionally passed as input with next user question for getting wider input context. On the first iteration, the user provided instructions joined to conversation history (if exists) converted to token ids using a tokenizer, then prepared input provided to the model. The model generates probabilities for all tokens in logits format The way the next token will be selected over predicted probabilities is driven by the selected decoding methodology. You can find more information about the most popular decoding methods in this [blog](https://huggingface.co/blog/how-to-generate). The result generation updates conversation history for next conversation step. it makes stronger connection of next question with previously provided and allows user to make clarifications regarding previously provided answers.https://docs.openvino.ai/2025/openvino-workflow-generative.html"
`1014`	`1014`	`]`
`1015`	`1015`	`},`
`1016`	`1016`	`{`
`@@ -1038,7 +1038,7 @@`
`1038`	`1038`	`" - Medium top_p (e.g., 0.8): The AI model considers tokens with a higher cumulative probability, such as \"playing,\" \"sleeping,\" and \"eating.\" \n",`
`1039`	`1039`	`" - High top_p (e.g., 1.0): The AI model considers all tokens, including those with lower probabilities, such as \"driving\" and \"flying.\" \n",`
`1040`	`1040`	" * `Top-k` is an another popular sampling strategy. In comparison with Top-P, which chooses from the smallest possible set of words whose cumulative probability exceeds the probability P, in Top-K sampling K most likely next words are filtered and the probability mass is redistributed among only those K next words. In our example with cat, if k=3, then only \"playing\", \"sleeping\" and \"eating\" will be taken into account as possible next word.\n",
`1041`		- " * `Repetition Penalty` This parameter can help penalize tokens based on how frequently they occur in the text, including the input prompt. A token that has already appeared five times is penalized more heavily than a token that has appeared only one time. A value of 1 means that there is no penalty and values larger than 1 discourage repeated tokens.https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html"
	`1041`	+ " * `Repetition Penalty` This parameter can help penalize tokens based on how frequently they occur in the text, including the input prompt. A token that has already appeared five times is penalized more heavily than a token that has appeared only one time. A value of 1 means that there is no penalty and values larger than 1 discourage repeated tokens.https://docs.openvino.ai/2025/openvino-workflow-generative.html"
`1042`	`1042`	`]`
`1043`	`1043`	`},`
`1044`	`1044`	`{`
Original file line number	Diff line number	Diff line change
`@@ -104,7 +104,7 @@`
`104`	`104`	`"LoRA can be easily added to [Diffusers pipeline](https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading_adapters#lora) before export. At the export stage, LoRA weights will be fused to original model weights and converted model will preserve LoRA provided behavior. This approach is suitable when you need model with adapter capabilities by default and it does not required configuration at inference time (e.g. changing weight coefficient for adapter).\n",`
`105`	`105`	`"For example, we can use this method for speedup generation process with integration [LCM LoRA](https://huggingface.co/blog/lcm_lora). Previously, we already considered with approach in this [tutorial](../latent-consistency-models-image-generation/lcm-lora-controlnet.ipynb).\n",`
`106`	`106`	`"\n",`
`107`		- "Using `optimum-cli` for exporting models requires to provide model id on HuggingFace Hub or local directory with saved model. In case, if model stored in multiple separated repositories or directories (e.g. you want to replace VAE component or add LoRA), it should be merged and saved on disk before export. For avoiding this, we will use `export_from_model` function that accepts initialized model. Additionally, for using model with OpenVINO GenAI, we need to export tokenizers to OpenVINO format using [OpenVINO Tokenizers](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/ov-tokenizers.html) library.\n",
	`107`	+ "Using `optimum-cli` for exporting models requires to provide model id on HuggingFace Hub or local directory with saved model. In case, if model stored in multiple separated repositories or directories (e.g. you want to replace VAE component or add LoRA), it should be merged and saved on disk before export. For avoiding this, we will use `export_from_model` function that accepts initialized model. Additionally, for using model with OpenVINO GenAI, we need to export tokenizers to OpenVINO format using [OpenVINO Tokenizers](https://docs.openvino.ai/2025/openvino-workflow-generative/ov-tokenizers.html) library.\n",
`108`	`108`	`"\n",`
`109`	`109`	`"In this tutorial we will use [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model, but the same steps are also applicable to other models of Stable Diffusion family."`
`110`	`110`	`]`
Original file line number	Diff line number	Diff line change
`@@ -90,7 +90,7 @@`
`90`	`90`	`"As example, we will use already converted LLMs from [OpenVINO collection](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd).\n",`
`91`	`91`	`"You can find OpenVINO optimized FastDraft models can be found in this [collection](https://huggingface.co/collections/OpenVINO/speculative-decoding-draft-models-673f5d944d58b29ba6e94161). As example we will use [Phi-3-mini-4k-instruct-int4-ov](https://huggingface.co/OpenVINO/Phi-3-mini-4k-instruct-int4-ov) as target model and [Phi-3-mini-FastDraft-50M-int8-ov](https://huggingface.co/OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov) as draft.\n",`
`92`	`92`	`"\n",`
`93`		`- "In case, if you want run own models, you should convert them using [Hugging Face Optimum](https://huggingface.co/docs/optimum/intel/openvino/export) library accelerated by OpenVINO integration. More details about model preparation can be found in [OpenVINO LLM inference guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-native-ov.html#convert-hugging-face-tokenizer-and-model-to-openvino-ir-format)"`
	`93`	`+ "In case, if you want run own models, you should convert them using [Hugging Face Optimum](https://huggingface.co/docs/optimum/intel/openvino/export) library accelerated by OpenVINO integration. More details about model preparation can be found in [OpenVINO LLM inference guide](https://docs.openvino.ai/2025/openvino-workflow-generative/genai-model-preparation.html)"`
`94`	`94`	`]`
`95`	`95`	`},`
`96`	`96`	`{`