Skip to content

Commit b9ade6a

Browse files
authored
add llava notebook with optimum inference (openvinotoolkit#2461)
1 parent a5ce26b commit b9ade6a

10 files changed

+859
-1697
lines changed

.ci/ignore_convert_execution.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ notebooks/llm-rag-langchain/llm-rag-langchain.ipynb
3737
notebooks/mms-massively-multilingual-speech/mms-massively-multilingual-speech.ipynb
3838
notebooks/bark-text-to-audio/bark-text-to-audio.ipynb
3939
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb
40-
notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
40+
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
4141
notebooks/pix2struct-docvqa/pix2struct-docvqa.ipynb
4242
notebooks/softvc-voice-conversion/softvc-voice-conversion.ipynb
4343
notebooks/latent-consistency-models-image-generation/latent-consistency-models-image-generation.ipynb

.ci/ignore_pip_conflicts.txt

-4
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,6 @@ notebooks/yolov8-optimization/yolov8-object-detection.ipynb # ultralytics==8.0.
66
notebooks/yolov8-optimization/yolov8-obb.ipynb # ultralytics==8.1.24
77
notebooks/llm-chatbot/llm-chatbot.ipynb # nncf@https://github.com/openvinotoolkit/nncf/tree/release_v280
88
notebooks/llm-rag-langchain/llm-rag-langchain.ipynb # nncf@https://github.com/openvinotoolkit/nncf/tree/release_v280
9-
notebooks/bark-text-to-audio/bark-text-to-audio.ipynb # torch==1.13
10-
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot.ipynb # transformers<4.35
11-
notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb # transformers<4.35
129
notebooks/paint-by-example/paint-by-example.ipynb # gradio==3.44.1
1310
notebooks/mobilevlm-language-assistant/mobilevlm-language-assistant.ipynb # transformers<4.35
1411
notebooks/depth-anything/depth-anything.ipynb # install requirements.txt after clone repo
@@ -22,6 +19,5 @@ notebooks/stable-diffusion-torchdynamo-backend/stable-diffusion-torchdynamo-back
2219
notebooks/sketch-to-image-pix2pix-turbo/sketch-to-image-pix2pix-turbo.ipynb
2320
notebooks/yolov10-optimization/yolov10-optimization.ipynb # nncf from git
2421
notebooks/person-counting-webcam/person-counting.ipynb # numpy should be installed first
25-
notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb # torchvision < 0.17.0
2622
notebooks/parler-tts-text-to-speech/parler-tts-text-to-speech.ipynb # torch >= 2.2
2723
notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb # diffusers from git

.ci/ignore_treon_docker.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ notebooks/tiny-sd-image-generation/tiny-sd-image-generation.ipynb
2828
notebooks/zeroscope-text2video/zeroscope-text2video.ipynb
2929
notebooks/mms-massively-multilingual-speech/mms-massively-multilingual-speech.ipynb
3030
notebooks/bark-text-to-audio/bark-text-to-audio.ipynb
31-
notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
32-
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot.ipynb
31+
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb
32+
notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
3333
notebooks/decidiffusion-image-generation/decidiffusion-image-generation.ipynb
3434
notebooks/pix2struct-docvqa/pix2struct-docvqa.ipynb
3535
notebooks/fast-segment-anything/fast-segment-anything.ipynb

.ci/skipped_notebooks.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@
218218
- ubuntu-20.04
219219
- ubuntu-22.04
220220
- windows-2019
221-
- notebook: notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
221+
- notebook: notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
222222
skips:
223223
- os:
224224
- macos-12

notebooks/README.md

-2
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,6 @@
6464
- [Create an Agentic RAG using OpenVINO and LlamaIndex](./llm-agent-react/llm-agent-rag-llamaindex.ipynb)
6565
- [Create Function-calling Agent using OpenVINO and Qwen-Agent](./llm-agent-functioncall/llm-agent-functioncall-qwen.ipynb)
6666
- [Visual-language assistant with LLaVA Next and OpenVINO](./llava-next-multimodal-chatbot/llava-next-multimodal-chatbot.ipynb)
67-
- [Visual-language assistant with Video-LLaVA and OpenVINO](./llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb)
6867
- [Visual-language assistant with LLaVA and OpenVINO Generative API](./llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb)
6968
- [Text-to-Image Generation with LCM LoRA and ControlNet Conditioning](./latent-consistency-models-image-generation/lcm-lora-controlnet.ipynb)
7069
- [Latent Consistency Model using Optimum-Intel OpenVINO](./latent-consistency-models-image-generation/latent-consistency-models-optimum-demo.ipynb)
@@ -244,7 +243,6 @@
244243
- [Create an Agentic RAG using OpenVINO and LlamaIndex](./llm-agent-react/llm-agent-rag-llamaindex.ipynb)
245244
- [Create Function-calling Agent using OpenVINO and Qwen-Agent](./llm-agent-functioncall/llm-agent-functioncall-qwen.ipynb)
246245
- [Visual-language assistant with LLaVA Next and OpenVINO](./llava-next-multimodal-chatbot/llava-next-multimodal-chatbot.ipynb)
247-
- [Visual-language assistant with Video-LLaVA and OpenVINO](./llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb)
248246
- [Visual-language assistant with LLaVA and OpenVINO Generative API](./llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb)
249247
- [Text-to-Image Generation with LCM LoRA and ControlNet Conditioning](./latent-consistency-models-image-generation/lcm-lora-controlnet.ipynb)
250248
- [Latent Consistency Model using Optimum-Intel OpenVINO](./latent-consistency-models-image-generation/latent-consistency-models-optimum-demo.ipynb)

notebooks/llava-multimodal-chatbot/README.md

+8-24
Original file line numberDiff line numberDiff line change
@@ -10,30 +10,14 @@ While LLaVA excels at image-based tasks, Video-LLaVA expands this fluency to the
1010

1111
In the field of artificial intelligence, the goal is to create a versatile assistant capable of understanding and executing tasks based on both visual and language inputs. Current approaches often rely on large vision models that solve tasks independently, with language only used to describe image content. While effective, these models have fixed interfaces with limited interactivity and adaptability to user instructions. On the other hand, large language models (LLMs) have shown promise as a universal interface for general-purpose assistants. By explicitly representing various task instructions in language, these models can be guided to switch and solve different tasks. To extend this capability to the multimodal domain, the [LLaVA paper](https://arxiv.org/abs/2304.08485) introduces `visual instruction-tuning`, a novel approach to building a general-purpose visual assistant.
1212

13-
In this tutorial series we consider how to use LLaVA and Video-LLaVA model to build multimodal chatbot with OpenVINO help.
14-
15-
## LLaVA
16-
### Notebook contents
17-
The tutorial consists from following steps:
18-
19-
- Install prerequisites
20-
- Prepare input processor and tokenizer
21-
- Download original model
22-
- Compress model weights to 4 and 8 bits using NNCF
23-
- Convert model to OpenVINO Intermediate Representation (IR) format
24-
- Prepare OpenVINO-based inference pipeline
25-
- Run OpenVINO model
26-
27-
## Video-LLaVA
28-
### Notebook contents
29-
The tutorial consists from following steps:
30-
31-
- Install prerequisites
32-
- Download original model
33-
- Compress model weights to 4 and 8 bits using NNCF
34-
- Convert model to OpenVINO Intermediate Representation (IR) format
35-
- Prepare OpenVINO-based inference pipeline
36-
- Run OpenVINO model
13+
In this tutorial series we consider how to use LLaVA model to build multimodal chatbot with OpenVINO help.
14+
15+
## Visual-language assistant with LLaVA and OpenVINO Generative API
16+
This [notebook](./llava-multimodal-chatbot-genai.ipynb) demonstrate how to effectively build Visual-Language assistant using [OpenVINO Generative API](https://github.com/openvinotoolkit/openvino.genai).
17+
18+
## Visual-language assistant with LLaVA and Optimum Intel OpenVINO integration
19+
This [notebook](./llava-multimodal-chatbot-optimum.ipynb) demonstrate how to effectively build Visual-Language assistant using [Optimum Intel](https://huggingface.co/docs/optimum/main/intel/index) OpenVINO integration.
20+
3721

3822
## Installation instructions
3923
This is a self-contained example that relies solely on its own code.</br>

notebooks/llava-multimodal-chatbot/gradio_helper.py

+91-65
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
from pathlib import Path
2-
from typing import Callable
32
import gradio as gr
43

54

65
from PIL import Image
7-
from typing import Callable
86
import numpy as np
97
import requests
108
from threading import Event, Thread
@@ -132,70 +130,98 @@ def generate_and_signal_complete():
132130
return demo
133131

134132

135-
def make_demo_videollava(fn: Callable):
136-
examples_dir = Path("Video-LLaVA/videollava/serve/examples")
137-
gr.close_all()
138-
demo = gr.Interface(
139-
fn=fn,
140-
inputs=[
141-
gr.Image(label="Input Image", type="filepath"),
142-
gr.Video(label="Input Video"),
143-
gr.Textbox(label="Question"),
144-
],
145-
outputs=gr.Textbox(lines=10),
133+
def make_demo_llava_optimum(model, processor):
134+
from transformers import TextIteratorStreamer
135+
136+
has_additonal_buttons = "undo_button" in inspect.signature(gr.ChatInterface.__init__).parameters
137+
138+
def bot_streaming(message, history):
139+
print(f"message is - {message}")
140+
print(f"history is - {history}")
141+
files = message["files"] if isinstance(message, dict) else message.files
142+
message_text = message["text"] if isinstance(message, dict) else message.text
143+
if files:
144+
# message["files"][-1] is a Dict or just a string
145+
if isinstance(files[-1], dict):
146+
image = files[-1]["path"]
147+
else:
148+
if isinstance(files[-1], (str, Path)):
149+
image = files[-1]
150+
else:
151+
image = files[-1] if isinstance(files[-1], (list, tuple)) else files[-1].path
152+
else:
153+
# if there's no image uploaded for this turn, look for images in the past turns
154+
# kept inside tuples, take the last one
155+
for hist in history:
156+
if type(hist[0]) == tuple:
157+
image = hist[0][0]
158+
try:
159+
if image is None:
160+
# Handle the case where image is None
161+
raise gr.Error("You need to upload an image for Llama-3.2-Vision to work. Close the error and try again with an Image.")
162+
except NameError:
163+
# Handle the case where 'image' is not defined at all
164+
raise gr.Error("You need to upload an image for Llama-3.2-Vision to work. Close the error and try again with an Image.")
165+
166+
conversation = []
167+
flag = False
168+
for user, assistant in history:
169+
if assistant is None:
170+
# pass
171+
flag = True
172+
conversation.extend([{"role": "user", "content": []}])
173+
continue
174+
if flag == True:
175+
conversation[0]["content"] = [{"type": "text", "text": f"{user}"}]
176+
conversation.append({"role": "assistant", "text": assistant})
177+
flag = False
178+
continue
179+
conversation.extend([{"role": "user", "content": [{"type": "text", "text": user}]}, {"role": "assistant", "text": assistant}])
180+
181+
conversation.append({"role": "user", "content": [{"type": "text", "text": f"{message_text}"}, {"type": "image"}]})
182+
prompt = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
183+
print(f"prompt is -\n{prompt}")
184+
image = Image.open(image)
185+
inputs = processor(text=prompt, images=image, return_tensors="pt")
186+
187+
streamer = TextIteratorStreamer(
188+
processor,
189+
**{
190+
"skip_special_tokens": True,
191+
"skip_prompt": True,
192+
"clean_up_tokenization_spaces": False,
193+
},
194+
)
195+
generation_kwargs = dict(
196+
inputs,
197+
streamer=streamer,
198+
max_new_tokens=1024,
199+
do_sample=False,
200+
temperature=0.0,
201+
eos_token_id=processor.tokenizer.eos_token_id,
202+
)
203+
204+
thread = Thread(target=model.generate, kwargs=generation_kwargs)
205+
thread.start()
206+
207+
buffer = ""
208+
for new_text in streamer:
209+
buffer += new_text
210+
yield buffer
211+
212+
additional_buttons = {}
213+
if has_additonal_buttons:
214+
additional_buttons = {"undo_button": None, "retry_button": None}
215+
216+
demo = gr.ChatInterface(
217+
fn=bot_streaming,
218+
title="LLaVA OpenVINO Chatbot",
146219
examples=[
147-
[
148-
f"{examples_dir}/extreme_ironing.jpg",
149-
None,
150-
"What is unusual about this image?",
151-
],
152-
[
153-
f"{examples_dir}/waterview.jpg",
154-
None,
155-
"What are the things I should be cautious about when I visit here?",
156-
],
157-
[
158-
f"{examples_dir}/desert.jpg",
159-
None,
160-
"If there are factual errors in the questions, point it out; if not, proceed answering the question. What’s happening in the desert?",
161-
],
162-
[
163-
None,
164-
f"{examples_dir}/sample_demo_1.mp4",
165-
"Why is this video funny?",
166-
],
167-
[
168-
None,
169-
f"{examples_dir}/sample_demo_3.mp4",
170-
"Can you identify any safety hazards in this video?",
171-
],
172-
[
173-
None,
174-
f"{examples_dir}/sample_demo_9.mp4",
175-
"Describe the video.",
176-
],
177-
[
178-
None,
179-
f"{examples_dir}/sample_demo_22.mp4",
180-
"Describe the activity in the video.",
181-
],
182-
[
183-
f"{examples_dir}/sample_img_22.png",
184-
f"{examples_dir}/sample_demo_22.mp4",
185-
"Are the instruments in the pictures used in the video?",
186-
],
187-
[
188-
f"{examples_dir}/sample_img_13.png",
189-
f"{examples_dir}/sample_demo_13.mp4",
190-
"Does the flag in the image appear in the video?",
191-
],
192-
[
193-
f"{examples_dir}/sample_img_8.png",
194-
f"{examples_dir}/sample_demo_8.mp4",
195-
"Are the image and the video depicting the same place?",
196-
],
220+
{"text": "What is on the flower?", "files": ["./bee.jpg"]},
221+
{"text": "How to make this pastry?", "files": ["./baklava.png"]},
197222
],
198-
title="Video-LLaVA🚀",
199-
allow_flagging="never",
223+
stop_btn=None,
224+
multimodal=True,
225+
**additional_buttons,
200226
)
201227
return demo

0 commit comments

Comments
 (0)