Skip to content

Commit b6103d7

Browse files
adrianboguszewskiRyanMetcalfeInt8riacheruvuriacheruvu
authoredFeb 11, 2025
Made multimodal kit consistent with the template (openvinotoolkit#174)
* Renamed directory for data * Addition of AI Adventure Experience kit (openvinotoolkit#180) * add initial ai_adventure_experience kit * Update ai_adventure_experience README.md * Update ai_adventure_experience README.md with GIF * Update README.md * Use channel-wise quantization for int4 llama3 model * README.md: Add portaudio19-dev to apt install * Update README.md * Update README.md * Update README.md * Update README.md * Pinning dependencies for AI Visual Gaming kit * Pinning dependencies for AI Visual Gaming kit (openvinotoolkit#181) * Fixed path * Consolidating requirements.txt files for Multimodal AI kit * Removing extraneous files from Multimodal AI kit (openvinotoolkit#191) * Updated the kit * Added new screenshot to main readme * Use llama3.2 * Update action.yml * Added new qt group for CI * Update action.yml * Update action.yml * Fixed no GPU device * Update action.yml --------- Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: riacheruvu <ria.cheruvu@intel.com> Co-authored-by: Ria Cheruvu <riasgenius@gmail.com>
1 parent ad555ec commit b6103d7

30 files changed

+1147
-3994
lines changed
 

‎.github/reusable-steps/categorize-projects/action.yml

+8
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ outputs:
1212
value: ${{ steps.group-subprojects.outputs.gradio }}
1313
webcam:
1414
value: ${{ steps.group-subprojects.outputs.webcam }}
15+
qt:
16+
value: ${{ steps.group-subprojects.outputs.qt }}
1517
js:
1618
value: ${{ steps.group-subprojects.outputs.js }}
1719

@@ -26,6 +28,7 @@ runs:
2628
python=()
2729
gradio=()
2830
webcam=()
31+
qt=()
2932
js=()
3033
3134
for dir in ${{ inputs.subprojects }}; do
@@ -35,6 +38,8 @@ runs:
3538
notebook+=("$dir")
3639
elif [ -f "$dir/requirements.txt" ] && grep -q "gradio" "$dir/requirements.txt"; then
3740
gradio+=("$dir")
41+
elif [ -f "$dir/requirements.txt" ] && grep -iq "pyside" "$dir/requirements.txt"; then
42+
qt+=("$dir")
3843
elif [ -f "$dir/main.py" ] && grep -q -- "--stream" "$dir/main.py"; then
3944
webcam+=("$dir")
4045
else
@@ -46,12 +51,14 @@ runs:
4651
python_json=$(printf '%s\n' "${python[@]}" | jq -R -s -c 'split("\n") | map(select(length > 0))')
4752
gradio_json=$(printf '%s\n' "${gradio[@]}" | jq -R -s -c 'split("\n") | map(select(length > 0))')
4853
webcam_json=$(printf '%s\n' "${webcam[@]}" | jq -R -s -c 'split("\n") | map(select(length > 0))')
54+
qt_json=$(printf '%s\n' "${qt[@]}" | jq -R -s -c 'split("\n") | map(select(length > 0))')
4955
js_json=$(printf '%s\n' "${js[@]}" | jq -R -s -c 'split("\n") | map(select(length > 0))')
5056
5157
echo "notebook=$notebook_json" >> $GITHUB_OUTPUT
5258
echo "python=$python_json" >> $GITHUB_OUTPUT
5359
echo "gradio=$gradio_json" >> $GITHUB_OUTPUT
5460
echo "webcam=$webcam_json" >> $GITHUB_OUTPUT
61+
echo "qt=$qt_json" >> $GITHUB_OUTPUT
5562
echo "js=$js_json" >> $GITHUB_OUTPUT
5663
- name: Print subprojects to test
5764
shell: bash
@@ -60,4 +67,5 @@ runs:
6067
echo "Python subprojects: ${{ steps.group-subprojects.outputs.python }}"
6168
echo "Gradio subprojects: ${{ steps.group-subprojects.outputs.gradio }}"
6269
echo "Webcam subprojects: ${{ steps.group-subprojects.outputs.webcam }}"
70+
echo "Qt subprojects: ${{ steps.group-subprojects.outputs.qt }}"
6371
echo "JS subprojects: ${{ steps.group-subprojects.outputs.js }}"

‎.github/reusable-steps/gradio-action/action.yml

+7-3
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,13 @@ runs:
1818
run: |
1919
cd ${{ inputs.project }}
2020
21-
# Start the Gradio app in the background
22-
python ${{ inputs.script }} 2>&1 | tee gradio_log.txt &
23-
21+
if [ "${{ runner.os }}" == "Linux" ]; then
22+
# Start the Gradio app in the background
23+
xvfb-run python ${{ inputs.script }} 2>&1 | tee gradio_log.txt &
24+
else
25+
python ${{ inputs.script }} 2>&1 | tee gradio_log.txt &
26+
fi
27+
2428
# Assign process ID
2529
app_pid=$(ps aux | grep -i '[p]ython main.py' | awk '{print $2}')
2630

‎.github/reusable-steps/setup-os/action.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,12 @@ runs:
1515
dotnet: true
1616
haskell: true
1717
docker-images: true
18-
- name: Install OpenCL (Ubuntu only)
18+
- name: Install OpenCL and EGL (Ubuntu only)
1919
if: runner.os == 'Linux'
2020
shell: bash
2121
run: |
2222
sudo apt-get update
23-
sudo apt-get install -y ocl-icd-opencl-dev
23+
sudo apt-get install -y ocl-icd-opencl-dev libegl1 libgles2 mesa-utils libxcb-cursor0 libxcb-xinerama0 libxcb-util1 libxcb-keysyms1 libxcb-randr0 libxkbcommon-x11-0 libegl1-mesa-dev
2424
- name: Install coreutils (macOS only)
2525
if: runner.os == 'macOS'
2626
shell: bash

‎.github/workflows/sanity-check-kits.yml

+32
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ jobs:
2020
find-subprojects:
2121
runs-on: ubuntu-latest
2222
outputs:
23+
qt: ${{ steps.categorize-subprojects.outputs.qt }}
2324
gradio: ${{ steps.categorize-subprojects.outputs.gradio }}
2425
webcam: ${{ steps.categorize-subprojects.outputs.webcam }}
2526
python: ${{ steps.categorize-subprojects.outputs.python }}
@@ -41,6 +42,37 @@ jobs:
4142
with:
4243
subprojects: ${{ steps.find-updates.outputs.subproject_dirs }}
4344

45+
qt:
46+
needs: find-subprojects
47+
if: ${{ needs.find-subprojects.outputs.qt != '[]' }}
48+
runs-on: ${{ matrix.os }}
49+
strategy:
50+
fail-fast: false
51+
matrix:
52+
os: [ubuntu-latest, windows-latest, macos-latest]
53+
python: ["3.10", "3.12"]
54+
subproject: ${{ fromJson(needs.find-subprojects.outputs.qt) }}
55+
steps:
56+
- uses: actions/checkout@v4
57+
- uses: ./.github/reusable-steps/setup-os
58+
- name: Set up Python ${{ matrix.python }}
59+
uses: actions/setup-python@v5
60+
with:
61+
python-version: ${{ matrix.python }}
62+
- uses: ./.github/reusable-steps/setup-python
63+
with:
64+
python: ${{ matrix.python }}
65+
project: ${{ matrix.subproject }}
66+
- name: Login to HF
67+
shell: bash
68+
run: |
69+
huggingface-cli login --token ${{ secrets.HF_TOKEN }}
70+
- uses: ./.github/reusable-steps/gradio-action
71+
with:
72+
script: main.py
73+
project: ${{ matrix.subproject }}
74+
timeout: 3600
75+
4476
gradio:
4577
needs: find-subprojects
4678
if: ${{ needs.find-subprojects.outputs.gradio != '[]' }}

‎ai_ref_kits/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ The Custom AI Assistant, powered by the OpenVINO™ toolkit, integrates voice-ac
9696
Understanding why computer vision models make certain predictions using data and model explainability can help us refine our models to be more efficient and performant. This solution demonstrates how to leverage the OpenVINO™ toolkit, Datumaro, and Ultralytics to generate data quality measurements and saliency maps to understand the predictions and performance of computer vision models during inference.
9797

9898
### 🖼️ Multimodal AI Visual Generator
99-
![multimodal-ai-visual-generator](https://github.com/user-attachments/assets/f113a126-4b44-4488-be4e-e4bf52a6cebc)
99+
![multimodal-ai-visual-generator](https://github.com/user-attachments/assets/2144ae33-9e41-4e48-9992-ddec17ef5579)
100100

101101
| [Multimodal AI Visual Generator](multimodal_ai_visual_generator) | |
102102
| - | - |
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,17 @@
11
<div id="top" align="center">
2-
<h1>Multimodal AI Visual Generator with OpenVINO™ Toolkit</h1>
2+
<h1>AI Adventure Experience with OpenVINO™ GenAI</h1>
33
<h4>
44
<a href="https://www.intel.com/content/www/us/en/developer/topic-technology/edge-5g/open-potential.html">🏠&nbsp;About&nbsp;the&nbsp;Kits&nbsp;·</a>
5-
<a href="https://www.youtube.com/watch?v=kn1jZ2nLFMY">👨‍💻&nbsp;Code&nbsp;Demo&nbsp;Video&nbsp;·</a>
65
</h4>
76
</div>
87

98
[![Apache License Version 2.0](https://img.shields.io/badge/license-Apache_2.0-green.svg)](https://github.com/openvinotoolkit/openvino_build_deploy/blob/master/LICENSE.txt)
109

11-
The Multimodal AI Visual Generator is designed for rapid prototyping, instant iteration, and seamless visualization of complex concepts. The kit integrates image creation with generative AI, automatic speech recognition (ASR), speech synthesis, large language models (LLMs), and natural language processing (NLP). It processes multimodal inputs from sources such as cameras, voice commands, or typed text to generate AI-driven visual outputs. Utilizing the Intel OpenVINO™ toolkit, the system enables seamless deployment of deep learning models across hardware platforms. Explore the demo to see its real-time visual generative AI workflow in action.
10+
The kit integrates image creation with generative AI, voice activity detection (VAD), automatic speech recognition (ASR), large language models (LLMs), and natural language processing (NLP). A live voice transcription pipeline is connected to an LLM, which makes intelligent decisions about whether the user is describing the scene to an adventure game. When the LLM detects a new scene, the LLM will produce a detailed text prompt suitable for stable diffusion, which the application uses to illustrate the image. Utilizing the OpenVINO™ GenAI framework, this kit demonstrates the use of text2image, LLM pipeline, and whisper speech2text APIs.
1211

1312
This kit uses the following technology stack:
1413
- [OpenVINO Toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) ([docs](https://docs.openvino.ai/))
15-
- [nanoLLaVA (multimodal)](https://huggingface.co/qnguyen3/nanoLLaVA)
14+
- [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai)
1615
- [Whisper](https://github.com/openai/whisper)
1716
- [Llama3-8b-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
1817
- [Single Image Super Resolution](https://arxiv.org/abs/1807.06779)
@@ -21,9 +20,9 @@ This kit uses the following technology stack:
2120

2221
Check out our [AI Reference Kits repository](/) for other kits.
2322

24-
![kit-gif](https://github.com/user-attachments/assets/f113a126-4b44-4488-be4e-e4bf52a6cebc)
23+
![ai_adventure_experience_desert](https://github.com/user-attachments/assets/2144ae33-9e41-4e48-9992-ddec17ef5579)
2524

26-
Contributors: Ria Cheruvu, Garth Long, Arisha Kumar, Paula Ramos, Dmitriy Pastushenkov, Zhuo Wu, and Raymond Lo.
25+
Contributors: Ryan Metcalfe, Garth Long, Arisha Kumar, Ria Cheruvu, Paula Ramos, Dmitriy Pastushenkov, Zhuo Wu, and Raymond Lo.
2726

2827
### What's New
2928

@@ -44,18 +43,18 @@ Now, let's dive into the steps starting with installing Python.
4443

4544
## Installing Prerequisites
4645

47-
Now, let's dive into the steps starting with installing Python. We recommend using Ubuntu to set up and run this project. This project requires Python 3.8 or higher and a few libraries. If you don't have Python installed on your machine, go to https://www.python.org/downloads/ and download the latest version for your operating system. Follow the prompts to install Python, making sure to check the option to add Python to your PATH environment variable.
46+
Now, let's dive into the steps starting with installing Python. This project requires Python 3.10 or higher and a few libraries. If you don't have Python installed on your machine, go to https://www.python.org/downloads/ and download the latest version for your operating system. Follow the prompts to install Python, making sure to check the option to add Python to your PATH environment variable.
4847

4948
Install libraries and tools:
5049

50+
If you're using Ubuntu, install required dependencies like this:
5151
```shell
52-
sudo apt install git git-lfs gcc python3-venv python3-dev
52+
sudo apt install git git-lfs gcc python3-venv python3-dev portaudio19-dev
5353
```
54-
5554
_NOTE: If you are using Windows, you will probably need to install [Microsoft Visual C++ Redistributable](https://aka.ms/vs/16/release/vc_redist.x64.exe) also._
5655

5756
## Setting Up Your Environment
58-
### Cloning the Repository
57+
### Cloning the Repository and Installing Dependencies
5958

6059
To clone the repository, run the following command:
6160

@@ -69,58 +68,37 @@ The above will clone the repository into a directory named "openvino_build_deplo
6968
cd openvino_build_deploy/ai_ref_kits/multimodal_ai_visual_generator
7069
```
7170

72-
Next, you’ll download and optimize the required models. This will involve the creation of a temporary virtual environment and the running of a download script. Your requirements.txt file will depend on the Python version you're using (3.11 or 3.12).
71+
Next the below will create a virtual environment, activate the environment, and install the required dependencies for the setup and execution of the project.
7372

74-
- nanoLLaVA (multimodal): Image recognition/captioning from webcam
75-
- Whisper: Speech recognition
76-
- Llama3-8b-instruct: Prompt refinement
77-
- Latent Consistency Models: Image generation
78-
79-
**Note:** If you would like to run Latent Consistency Models on the NPU, as shown in the demo above, please follow the following steps: Download the model from this location "https://huggingface.co/Intel/sd-1.5-lcm-openvino" and compile it via the steps located at https://github.com/intel/openvino-ai-plugins-gimp/blob/v2.99-R3-staging/model_setup.py.
80-
81-
- AI Super Resolution: Increase the resolution of the generated image
82-
- Depth Anything v2: Create 3d parallax animations
83-
73+
Linux:
8474
```shell
85-
python3 -m venv model_installation_venv
86-
source model_installation_venv/bin/activate
75+
python3 -m venv run_env
76+
source run_env/bin/activate
8777
pip install -r requirements.txt
88-
python3 download_and_prepare_models.py
89-
```
90-
After model installation, you can remove the `model_installation_venv` virtual environment as it is no longer needed.
91-
92-
### Creating a Virtual Environment
93-
94-
To create a virtual environment, open your terminal or command prompt and navigate to the directory where you want to create the environment. Then, run the following command:
95-
96-
```shell
97-
python3 -m dnd_env
9878
```
99-
This will create a new virtual environment named "dnd_env" in the current directory.
100-
101-
### Activating the Environment
102-
103-
Activate the virtual environment using the following command:
10479

80+
Windows:
10581
```shell
106-
source dnd_env/bin/activate # For Unix-based operating systems such as Linux or macOS
107-
```
108-
109-
_NOTE: If you are using Windows, use the `dnd_env\Scripts\activate` command instead._
110-
111-
This will activate the virtual environment and change your shell's prompt to indicate that you are now working within that environment.
82+
python -m venv run_env
83+
run_env/Scripts/activate
84+
pip install -r requirements.txt
85+
```
11286

113-
### Installing the Packages
87+
### Downloading and Preparing Models
88+
Next, you’ll download and optimize the required models via the running of a download script.
11489

115-
To install the required packages, run the following commands:
90+
- Whisper: Speech recognition
91+
- Llama3-8b-instruct: Intelligent LLM helper
92+
- Latent Consistency Models: Image generation
93+
- Super Resolution: Increase the resolution of the generated image
94+
- Depth Anything v2: Create 3d parallax animations
11695

96+
To run the download script:
11797
```shell
118-
pip install -r requirements.txt
119-
pip install "openai-whisper==20231117" --extra-index-url https://download.pytorch.org/whl/cpu
120-
```
121-
98+
python3 download_and_prepare_models.py
99+
cd ..
100+
```
122101
## Running the Application
123-
![SIGGRAPH Drawing](https://github.com/user-attachments/assets/3ce58b50-4ee9-4dae-aeb6-0af5368a3ddd)
124102

125103
To interact with the animated GIF outputs, host a simple web server on your system as the final output. To do so, please install Node.js via [its Download page](https://nodejs.org/en/download/package-manager) and [http-server](https://www.npmjs.com/package/http-server).
126104

@@ -130,38 +108,62 @@ Run the following command to start an HTTP server within the repository. You can
130108
http-server -c10
131109
```
132110

133-
Open a terminal or you can use the existing one with `dnd_env` environment activated and start the Gradio GUI - <br>
111+
Open a terminal or you can use the existing one with `run_env` environment activated and start the GUI - <br>
134112

135113
```shell
136-
python3 gradio_ui.py
114+
python app.py
137115
```
138116

139-
Click on the web link to open the GUI in the web browser.
117+
![UI Drawing](https://github.com/user-attachments/assets/4f37f4d1-31c1-4534-82eb-d370fe29873a)
140118

141-
![demo screenshot](https://github.com/user-attachments/assets/ddfea7f0-3f1d-4d1c-b356-3bc959a23837)
142119

143-
### 📷 Submit a picture
144-
Take or upload a picture of any object via the Gradio image interface. Your "theme" will become the image description, if the object in the image is clearly captured.
120+
### ➕ Set the theme for your story
121+
This theme is passed as part of the system message to the LLM, and helps the LLM make more a more educated decision about whether you are describing the scene to a story, or not.
145122

146-
### 🗣 Speak your prompt
147-
Start or upload a recording, wait for the server to listen, and speak your prompt to life. Click the “Stop” button to stop the generation.
123+
### ➕ Click the Start Button
124+
The start button will activate the listening state (Voice Activity Detection & Whisper Transcription pipelines) on the system's default input device (microphone).
148125

149-
### ➕ Add a theme to prompt
150-
Now, your prompt is transcribed! Click the "Add Theme to Prompt" button to combine your prompt and theme.
126+
### 🗣 Describe a scene to your story
127+
Go ahead and describe a scene your story. For example, "You find yourself at the gates of a large, abandoned castle."
151128

152-
### ⚙️ Refine it with an LLM
153-
You can optionally ask an LLM model to refine your model by clicking the LLM button. It will try its best to generate a prompt infusing the elements.
129+
### 🖼️ Wait for your illustration
130+
The scene that you just described will be passed to the LLM, which should detect it as a new scene to your story. The detailed prompt that is generated by the LLM will show up in real-time in the UI caption box, followed soon after by the illustration generated from the stable diffusion pipeline.
154131

155-
### 🖼️ Generate your image and depth map
156-
Click "Generate Image" to see your image come to life. A depth map will automatically be generated for the image as well. Feel free to adjust the advanced parameters to control the image generation model.
132+
### 🗣 Talk about something not relevant to your story
133+
You can test the intelligence of the LLM helper and say something not relevant to the story. For example, "Hey guys, do you think we should order a pizza?". You should find that the LLM will make the decision to disregard this, and not try to illustrate anything.
157134

158135
### 🪄🖼️ Interact with the animated GIF
159136
To interact with the 3D hoverable animation created with depth maps, start an HTTP server as explained above, and you will be able to interact with the parallax.
160137

161-
<p align="right"><a href="#top">Back to top ⬆️</a></p>
138+
## :bulb: Additional Tips
139+
* Feel free to modify `main.py` to select different OpenVINO devices for the llm, stable diffusion pipeline, whisper, etc.
140+
Look toward the bottom of the script, for a section that looks like this:
141+
```
142+
if __name__ == "__main__":
143+
app = QApplication(sys.argv)
144+
145+
llm_device = 'GPU'
146+
sd_device = 'GPU'
147+
whisper_device = 'CPU'
148+
super_res_device = 'GPU'
149+
depth_anything_device = 'GPU'
150+
```
151+
If you're running on an Intel Core Ultra Series 2 laptop, and you want to set ```llm_device = 'NPU'```, be sure to have latest NPU driver installed, from [here](https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html)
152+
153+
* Based on the resolution of your display, you may want to tweak the default resolution of the illustrated image, as well as caption font size.
154+
To adjust the resolution of the illustrated image, look for and modify this line:
155+
```
156+
self.image_label.setFixedSize(1216, 684)
157+
```
158+
It's recommended to choose a 16:9 ratio resolution. You can find a convenient list [here](https://pacoup.com/2011/06/12/list-of-true-169-resolutions/).
159+
160+
The caption font size can be adjusted by modifying this line:
161+
```
162+
fantasy_font = QFont("Papyrus", 18, QFont.Bold)
163+
```
162164

163165
# Additional Resources
164166
- Learn more about [OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
165167
- Explore [OpenVINO’s documentation](https://docs.openvino.ai/2023.0/home.html)
166168

167-
<p align="right"><a href="#top">Back to top ⬆️</a></p>
169+
<p align="right"><a href="#top">Back to top ⬆️</a></p>

0 commit comments

Comments
 (0)