Skip to content

Commit 7e95076

Browse files
authored
Add notebook for llava multimodal chatbot (openvinotoolkit#1353)
* Add notebook for llava multimodal chatbot * readme, text, requirements * spell check * gradio * code style * fix readme * refactoring and better explanation * Update README.md
1 parent 00e566e commit 7e95076

9 files changed

+1301
-1
lines changed

.ci/ignore_convert_execution.txt

+1
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,5 @@ notebooks/248-stable-diffusion-xl/248-stable-diffusion-xl.ipynb
2828
notebooks/249-oneformer-segmentation/249-oneformer-segmentation.ipynb
2929
notebooks/251-tiny-sd-image-generation/251-tiny-sd-image-generation.ipynb
3030
notebooks/254-llm-chatbot/254-llm-chatbot.ipynb
31+
notebooks/257-llava-multimodal-chatbot/257-llava-multimodal-chatbot.ipynb
3132
notebooks/404-style-transfer-webcam/404-style-transfer.ipynb

.ci/ignore_treon_docker.txt

+1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
254-llm-chatbot
1919
255-mms-massively-multilingual-speech
2020
256-bark-text-to-audio
21+
257-llava-multimodal-chatbot
2122
301-tensorflow-training-openvino
2223
305-tensorflow-quantization-aware-training
2324
404-style-transfer-webcam

.ci/ignore_treon_linux.txt

+1
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,5 @@
2020
254-llm-chatbot
2121
255-mms-massively-multilingual-speech
2222
256-bark-text-to-audio
23+
257-llava-multimodal-chatbot
2324
404-style-transfer-webcam

.ci/ignore_treon_mac.txt

+1
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,5 @@
2020
254-llm-chatbot
2121
255-mms-massively-multilingual-speech
2222
256-bark-text-to-audio
23+
257-llava-multimodal-chatbot
2324
404-style-transfer-webcam

.ci/ignore_treon_win.txt

+1
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@
2222
254-llm-chatbot
2323
255-mms-massively-multilingual-speech
2424
256-bark-text-to-audio
25+
257-llava-multimodal-chatbot

.ci/spellcheck/.pyspelling.wordlist.txt

+3-1
Original file line numberDiff line numberDiff line change
@@ -240,13 +240,15 @@ LaBSE
240240
LAION
241241
Lasinger
242242
latents
243-
LLaMa
244243
LeViT
245244
LibriSpeech
246245
librispeech
247246
Lim
248247
Liu
249248
LLama
249+
LLaMa
250+
LLaVA
251+
llava
250252
llm
251253
LLM
252254
LLMs

README.md

+2
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Check out the latest notebooks that show how to optimize and deploy popular mode
3232
| [ZeroScope Text-to-video synthesis](notebooks/253-zeroscope-text2video)<br> | Text-to-video synthesis with ZeroScope and OpenVINO™ | A panda eating bamboo on a rock <img src="https://github.com/itrushkin/openvino_notebooks/assets/76161256/500956d5-4aac-4710-a77c-4df34bcda3be" width=300> |
3333
| [LLM chatbot](notebooks/254-llm-chatbot)<br> | Create LLM-powered Chatbot using OpenVINO™ | <img src=https://user-images.githubusercontent.com/29454499/255799218-611e7189-8979-4ef5-8a80-5a75e0136b50.png width=600> |
3434
| [Bark Text-to-Speech](notebooks/256-bark-text-to-audio/)<br> | Text-to-Speech generation using Bark and OpenVINO™ | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/9a770279-0045-480e-95f2-1a2f2d0a5115 width=300>
35+
| [LLaVA Multimodal Chatbot](notebooks/257-llava-multimodal-chatbot/)<br> | Visual-language assistant with LLaVA and OpenVINO™ | <img src=https://raw.githubusercontent.com/haotian-liu/LLaVA/main/images/llava_logo.png width=300>
3536

3637
## Table of Contents
3738

@@ -186,6 +187,7 @@ Demos that demonstrate inference on a particular model.
186187
| [254-llm-chatbot](notebooks/254-llm-chatbot)<br> | Create LLM-powered Chatbot using OpenVINO™ | <img src=https://user-images.githubusercontent.com/29454499/255799218-611e7189-8979-4ef5-8a80-5a75e0136b50.png width=600> |
187188
| [255-mms-massively-multilingual-speech](notebooks/255-mms-massively-multilingual-speech/)<br> | MMS: Scaling Speech Technology to 1000+ languages with OpenVINO™ | |
188189
| [256-bark-text-to-audio](notebooks/256-bark-text-to-audio)<br> | Text-to-Speech generation using Bark and OpenVINO™ | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/9a770279-0045-480e-95f2-1a2f2d0a5115 width=225> |
190+
| [257-llava-multimodal-chatbot](notebooks/257-llava-multimodal-chatbot)<br> | Visual-language assistant with LLaVA and OpenVINO™ | <img src=https://raw.githubusercontent.com/haotian-liu/LLaVA/main/images/llava_logo.png width=225> |
189191

190192
<div id='-model-training'></div>
191193

notebooks/257-llava-multimodal-chatbot/257-llava-multimodal-chatbot.ipynb

+1,264
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Visual-language assistant with LLaVA and OpenVINO
2+
3+
![llava_logo.png](https://raw.githubusercontent.com/haotian-liu/LLaVA/main/images/llava_logo.png)
4+
5+
*image source: [LLaVA repository](https://github.com/haotian-liu/LLaVA/blob/main/images/llava_logo.png)*
6+
7+
[LLaVA](https://llava-vl.github.io) (Large Language and Vision Assistant) is large multimodal model that aims to develop a general-purpose visual assistant that can follow both language and image instructions to complete various real-world tasks. The idea is to combine the power of large language models (LLMs) with vision encoders like CLIP to create an end-to-end trained neural assistant that understands and acts upon multimodal instructions.
8+
9+
In the field of artificial intelligence, the goal is to create a versatile assistant capable of understanding and executing tasks based on both visual and language inputs. Current approaches often rely on large vision models that solve tasks independently, with language only used to describe image content. While effective, these models have fixed interfaces with limited interactivity and adaptability to user instructions. On the other hand, large language models (LLMs) have shown promise as a universal interface for general-purpose assistants. By explicitly representing various task instructions in language, these models can be guided to switch and solve different tasks. To extend this capability to the multimodal domain, the [LLaVA paper](https://arxiv.org/abs/2304.08485) introduces `visual instruction-tuning`, a novel approach to building a general-purpose visual assistant.
10+
11+
In this tutorial we consider how to use LLaVA model to build multimodal chatbot with OpenVINO help.
12+
13+
## Notebook contents
14+
The tutorial consists from following steps:
15+
16+
- Install prerequisites
17+
- Prepare input processor and tokenizer
18+
- Download original model
19+
- Compress model weights to INT8 using NNCF
20+
- Convert model to OpenVINO Intermediate Representation (IR) format
21+
- Prepare OpenVINO-based inference pipeline
22+
- Run OpenVINO model
23+
24+
## Installation instructions
25+
This is a self-contained example that relies solely on its own code.</br>
26+
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
27+
For details, please refer to [Installation Guide](../../README.md).

0 commit comments

Comments
 (0)