Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agentic kit: Fix issues, improve performance, and add shopping cart feature #212

Merged
merged 28 commits into from
Mar 14, 2025
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
a9adb66
Moved files, updated readme
adrianboguszewski Jan 20, 2025
9b340b3
Moved code to main function
adrianboguszewski Jan 20, 2025
09cc821
Agentic LLM RAG: Fix issues, improve performance, and add shopping ca…
antoniomtz Feb 27, 2025
73c3b9e
Removing config files
antoniomtz Feb 27, 2025
0ed855b
solve merge conflicts
antoniomtz Feb 27, 2025
a680b2d
Adding missing requirements.txt file
antoniomtz Feb 27, 2025
5bcf062
Changing app.py to main.py for comply with github actions
antoniomtz Feb 27, 2025
599645a
Changing app.py to main.py for comply with github actions
antoniomtz Feb 27, 2025
4a21c52
remove personality arg
antoniomtz Feb 28, 2025
1cd7643
change default model for convert and optimize
antoniomtz Feb 28, 2025
cd72f88
Improving documentation and public arg in main.py
antoniomtz Mar 5, 2025
47ca638
Address PR feedback
antoniomtz Mar 6, 2025
643483a
Merge branch 'master' into agentic-kit
adrianboguszewski Mar 6, 2025
326b05e
fix hf_token arg
antoniomtz Mar 6, 2025
0028021
Merge branch 'agentic-kit' of https://github.com/openvinotoolkit/open…
antoniomtz Mar 6, 2025
92fcc62
Adding main.py wrapper for CI/CDs
antoniomtz Mar 10, 2025
414dc0c
Rename app.py
antoniomtz Mar 10, 2025
466ea9d
Adding device arg
antoniomtz Mar 10, 2025
1e1861c
Add model mapping for bge models
antoniomtz Mar 10, 2025
a0a7258
Update args in main.py
antoniomtz Mar 10, 2025
cbf9da4
fixing main.py
antoniomtz Mar 10, 2025
92667ac
Set AUTO:GPU,CPU to main.py
antoniomtz Mar 10, 2025
4bb4705
Update README
antoniomtz Mar 12, 2025
11b3f89
Update gif in README and add queue() to Gradio demo
antoniomtz Mar 12, 2025
746d9d7
Update README
antoniomtz Mar 12, 2025
ce0a65f
Update README
antoniomtz Mar 12, 2025
72d3136
test llama 3B for CICDs
antoniomtz Mar 13, 2025
99c782a
update image
antoniomtz Mar 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions ai_ref_kits/agentic_llm_rag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
<div id="top" align="center">
<h1>AI Insight Agent with RAG</h1>
<h4>
<a href="https://www.intel.com/content/www/us/en/developer/topic-technology/edge-5g/open-potential.html">🏠&nbsp;About&nbsp;the&nbsp;Kits&nbsp;</a>
<!-- <a href="">👨‍💻&nbsp;Code&nbsp;Demo&nbsp;Video</a> -->
</h4>
</div>

[![Apache License Version 2.0](https://img.shields.io/badge/license-Apache_2.0-green.svg)](https://github.com/openvinotoolkit/openvino_build_deploy/blob/master/LICENSE.txt)

<p align="center">
<img src="https://github.com/user-attachments/assets/dd626685-7aa6-4e67-a929-5e9be2982800" width="500">
</p>

The AI Insight Agent with RAG uses Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to interpret user prompts, engage in meaningful dialogue, perform calculations, use RAG techniques to improve its knowledge and interact with the user to add items to a virtual shopping cart. This solution uses the OpenVINO™ toolkit to power the AI models at the edge. Designed for both consumers and employees, it functions as a smart, personalized retail assistant, offering an interactive and user-friendly experience similar to an advanced digital kiosk.

This kit uses the following technology stack:
- [OpenVINO Toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) ([docs](https://docs.openvino.ai/))
- [Qwen2-7B-Instruct](https://huggingface.co/Qwen)
- [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)
- [Gradio interface](https://www.gradio.app/docs/gradio/chatinterface)

Check out our [AI Reference Kits repository](/) for other kits.

![ai-insight-agent-with-rag](https://github.com/user-attachments/assets/da97bea7-29e8-497f-b7ba-4e00c79773f1)

<details open><summary><b>Table of Contents</b></summary>

- [Getting Started](#get-started)
- [Installing Prerequisites](#install-prerequisites)
- [Setting Up Your Environment](#set-up-your-environment)
- [Converting and Optimizing the Model](*convert-and-optimize-the-model)
- [Running the Application](#run-the-application)
- [Additional Resources](#additional-resources)

</details>

# Getting Started

To get started with the AI Insight Agent with RAG, you install Python, set up your environment, and then you can run the application. We recommend using Ubuntu 24.04 to set up and run this project.

## Installing Prerequisites

This project requires Python 3.8 or higher and a few libraries. If you don't already have Python installed on your machine, go to [https://www.python.org/downloads/](https://www.python.org/downloads/) and download the latest version for your operating system. Follow the prompts to install Python, and make sure to select the option to add Python to your PATH environment variable.

To install the Python libraries and tools, run this command:

```shell
sudo apt install git gcc python3-venv python3-dev
```

_NOTE: If you are using Windows, you might also have to install [Microsoft Visual C++ Redistributable](https://aka.ms/vs/16/release/vc_redist.x64.exe)._

## Setting Up Your Environment

To set up your environment, you first clone the repository, then create a virtual environment, activate the environment, and install the packages.

### Clone the Repository

To clone the repository, run this command:

```shell
git clone https://github.com/openvinotoolkit/openvino_build_deploy.git
```

This command clones the repository into a directory named "openvino_build_deploy" in the current directory. After the directory is cloned, run the following command to go to that directory:


```shell
cd openvino_build_deploy/ai_ref_kits/agentic_llm_rag
```

### Create a Virtual Environment

To create a virtual environment, open your terminal or command prompt, and go to the directory where you want to create the environment.

Run the following command:

```shell
python3 -m venv venv
```
This creates a new virtual environment named "venv" in the current directory.

### Activate the Environment

The command you run to activate the virtual environment you created depends on whether you have a Unix-based operating system (Linux or macOS) or a Windows operating system.

To activate the virtual environment for a **Unix-based** operating system, run:

```shell
source venv/bin/activate # For Unix-based operating systems such as Linux or macOS
```

To activate the virtual environment for a **Windows** operating system, run:

```shell
venv\Scripts\activate # This command is for Windows operating systems
```
This activates the virtual environment and changes your shell's prompt to indicate that you are now working in that environment.

### Install the Packages

To install the required packages, run the following commands:

```shell
python -m pip install --upgrade pip
pip install -r requirements.txt
```
## Converting and Optimizing the Model

The application uses 2 separate models. Each model requires conversion and optimization for use with OpenVINO™. The following process includes a step to convert and optimize each model.

_NOTE: This reference kit requires more than 8GB of bandwidth and disk space for downloading models. Because of the large model size, when you run the kit for the first time, the conversion can take more than two hours and require more than 32GB of memory. After the first run, the subsequent runs should finish much faster._

## Chat Model and Embedding Model Conversion

The _chat model_ is the core of the chatbot's ability to generate meaningful and context-aware responses.

The _embedding model_ represents text data (both user queries and potential responses or knowledge base entries) as numerical vectors. These vectors are essential for tasks such as semantic search and similarity matching.

This conversion script handles the conversion and optimization of:

- The chat model (`qwen2-7B`) with `int4` precision.
- The embedding model (`bge-large`) with `FP32` precision.

After the models are converted, they’re saved to the model directory you specify when you run the script.

_Requests can take up to one hour to process._

To convert the chat and embedding models, run:
```shell
python convert_and_optimize_llm.py --chat_model_type qwen2-7B --embedding_model_type bge-large --precision int4 --model_dir model
```

After you run the conversion scripts, you can run `main.py` to launch the application.

## Running the Application (Gradio Interface)

To run the AI Insight Agent with RAG application, you execute the following python script. Make sure to include all of the necessary model directory arguments.

_NOTE: This application requires more than 16GB of memory because the models are very large (especially the chatbot model). If you have a less powerful device, the application might also run slowly._

After that, you should be able to run the application with default values:

```shell
python main.py
```

For more settings, you can change the argument values:

- `--chat_model`: The path to your chat model directory (for example, `model/qwen2-7B-INT4`) that drives conversation flow and response generation.

- `--rag_pdf`: The path to the document (for example, `data/test_painting_llm_rag.pdf`) that contains additional knowledge for Retrieval-Augmented Generation (RAG).

- `--embedding_model`: The path to your embedding model directory (for example, `model/bge-small-FP32`) for understanding and matching text inputs.

- `--device`: Include this flag to select the inference device for both models. (for example, `CPU`). If you have access to a dedicated GPU (ARC, Flex), you can change the value to `GPU.1`. Possible values: `CPU,GPU,GPU.1,NPU`

- `--public`: Include this flag to make the Gradio interface publicly accessible over the network. Without this flag, the interface will only be available on your local machine.

To run the application, execute the `main.py` script with the following command. Make sure to include all necessary model directory arguments.
```shell
python main.py \
--chat_model model/qwen2-7B-INT4 \
--embedding_model data/test_painting_llm_rag.pdf \
--rag_pdf model/bge-small-FP32 \
--device GPU.1 \
--public
```

### System Prompt Usage in LlamaIndex ReActAgent

The LlamaIndex ReActAgent library relies on a default system prompt that provides essential instructions to the LLM for correctly interacting with available tools. This prompt is fundamental for enabling both tool usage and RAG (Retrieval-Augmented Generation) queries.

#### Important:
Do not override or modify the default system prompt. Altering it may prevent the LLM from using the tools or executing RAG queries properly.

#### Customizing the Prompt:
If you need to add extra rules or custom behavior, modify the Additional Rules section located in the system_prompt.py file.

### Use the Web Interface
After the script runs, Gradio provides a local URL (typically `http://127.0.0.1:XXXX`) that you can open in your web browser to interact with the assistant. If you configured the application to be accessible publicly, Gradio also provides a public URL.

#### Test the Application
When you test the AI Insight Agent with RAG application, you can test both the interaction with the agent and the product selection capabilities.

1. Open a web browers and go to the Gradio-provided URL.
_For example, `http://127.0.0.1:XXXX`._
2. Test text interaction with the application.
- Type your question in the text box and press **Enter**.
_The assistant responds to your question in text form._

For further testing of the AI Insight Agent with RAG appplication, you can engage with the chatbot assistant by asking it questions, or giving it commands that align with the assistant's capabilities. This hands-on experience can help you to understand the assistant's interactive quality and performance.

Enjoy exploring the capabilities of your AI Insight Agent with RAG appplication!

# Additional Resources
- Learn more about [OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
- Explore [OpenVINO’s documentation](https://docs.openvino.ai/2024/home.html)

<p align="right"><a href="#top">Back to top ⬆️</a></p>
138 changes: 138 additions & 0 deletions ai_ref_kits/agentic_llm_rag/convert_and_optimize_llm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
import argparse
from pathlib import Path

import numpy as np
import openvino as ov
from openvino.runtime import opset10 as ops
from openvino.runtime import passes
from optimum.intel import OVModelForCausalLM, OVModelForFeatureExtraction, OVWeightQuantizationConfig, OVConfig, OVQuantizer
from transformers import AutoTokenizer

MODEL_MAPPING = {
"qwen2-7B": "Qwen/Qwen2-7B-Instruct",
"bge-large": "BAAI/bge-large-en-v1.5",
}

def optimize_model_for_npu(model: OVModelForFeatureExtraction):
"""
Fix some tensors to support NPU inference

Params:
model: model to fix
"""
class ReplaceTensor(passes.MatcherPass):
def __init__(self, packed_layer_name_tensor_dict_list):
super().__init__()
self.model_changed = False

param = passes.WrapType("opset10.Multiply")

def callback(matcher: passes.Matcher) -> bool:
root = matcher.get_match_root()
if root is None:
return False
for y in packed_layer_name_tensor_dict_list:
root_name = root.get_friendly_name()
if root_name.find(y["name"]) != -1:
max_fp16 = np.array([[[[-np.finfo(np.float16).max]]]]).astype(np.float32)
new_tensor = ops.constant(max_fp16, ov.Type.f32, name="Constant_4431")
root.set_arguments([root.input_value(0).node, new_tensor])
packed_layer_name_tensor_dict_list.remove(y)

return True

self.register_matcher(passes.Matcher(param, "ReplaceTensor"), callback)

packed_layer_tensor_dict_list = [{"name": "aten::mul/Multiply"}]

manager = passes.Manager()
manager.register_pass(ReplaceTensor(packed_layer_tensor_dict_list))
manager.run_passes(model.model)
model.reshape(1, 512)


def convert_chat_model(model_type: str, precision: str, model_dir: Path) -> Path:
"""
Convert chat model

Params:
model_type: selected mode type and size
precision: model precision
model_dir: dir to export model
Returns:
Path to exported model
"""
output_dir = model_dir / model_type
model_name = MODEL_MAPPING[model_type]

# if access_token is not None:
# os.environ["HUGGING_FACE_HUB_TOKEN"] = access_token

# load model and convert it to OpenVINO
model = OVModelForCausalLM.from_pretrained(model_name, export=True, compile=False, load_in_8bit=False)
# change precision to FP16
model.half()

if precision != "fp16":
# select quantization mode
quant_config = OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8) if precision == "int4" else OVWeightQuantizationConfig(bits=8, sym=False)
config = OVConfig(quantization_config=quant_config)

suffix = "-INT4" if precision == "int4" else "-INT8"
output_dir = output_dir.with_name(output_dir.name + suffix)

# create a quantizer
quantizer = OVQuantizer.from_pretrained(model, task="text-generation")
# quantize weights and save the model to the output dir
quantizer.quantize(save_directory=output_dir, weights_only=True, ov_config=config)
else:
output_dir = output_dir.with_name(output_dir.name + "-FP16")
# save converted model
model.save_pretrained(output_dir)

# export also tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_dir)

return Path(output_dir) / "openvino_model.xml"


def convert_embedding_model(model_type: str, model_dir: Path) -> Path:
"""
Convert embedding model

Params:
model_type: selected mode type and size
model_dir: dir to export model
Returns:
Path to exported model
"""
output_dir = model_dir / model_type
output_dir = output_dir.with_name(output_dir.name + "-FP32")
model_name = MODEL_MAPPING[model_type]

# load model and convert it to OpenVINO
model = OVModelForFeatureExtraction.from_pretrained(model_name, export=True, compile=False)
optimize_model_for_npu(model)
model.save_pretrained(output_dir)

# export tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_dir)

return Path(output_dir) / "openvino_model.xml"


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--chat_model_type", type=str, choices=["qwen2-7B"],
default="qwen2-7B", help="Chat model to be converted")
parser.add_argument("--embedding_model_type", type=str, choices=["bge-large"],
default="bge-large", help="Embedding model to be converted")
parser.add_argument("--precision", type=str, default="int4", choices=["fp16", "int8", "int4"], help="Model precision")
# parser.add_argument("--hf_token", type=str, help="HuggingFace access token")
parser.add_argument("--model_dir", type=str, default="model", help="Directory to place the model in")

args = parser.parse_args()
convert_embedding_model(args.embedding_model_type, Path(args.model_dir))
convert_chat_model(args.chat_model_type, args.precision, Path(args.model_dir))
Loading
Loading