add llava notebook with optimum inference (openvinotoolkit#2461)

eaidova · web-flow · commit b9ade6a6e6eb · 2024-10-21T10:38:40.000+04:00
diff --git a/.ci/ignore_convert_execution.txt b/.ci/ignore_convert_execution.txt
@@ -37,7 +37,7 @@ notebooks/llm-rag-langchain/llm-rag-langchain.ipynb
 notebooks/mms-massively-multilingual-speech/mms-massively-multilingual-speech.ipynb
 notebooks/bark-text-to-audio/bark-text-to-audio.ipynb
 notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb
-notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
+notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
 notebooks/pix2struct-docvqa/pix2struct-docvqa.ipynb
 notebooks/softvc-voice-conversion/softvc-voice-conversion.ipynb
 notebooks/latent-consistency-models-image-generation/latent-consistency-models-image-generation.ipynb
diff --git a/.ci/ignore_pip_conflicts.txt b/.ci/ignore_pip_conflicts.txt
@@ -6,9 +6,6 @@ notebooks/yolov8-optimization/yolov8-object-detection.ipynb  # ultralytics==8.0.
 notebooks/yolov8-optimization/yolov8-obb.ipynb  # ultralytics==8.1.24
 notebooks/llm-chatbot/llm-chatbot.ipynb # nncf@https://github.com/openvinotoolkit/nncf/tree/release_v280
 notebooks/llm-rag-langchain/llm-rag-langchain.ipynb # nncf@https://github.com/openvinotoolkit/nncf/tree/release_v280
-notebooks/bark-text-to-audio/bark-text-to-audio.ipynb  # torch==1.13
-notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot.ipynb # transformers<4.35
-notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb # transformers<4.35
 notebooks/paint-by-example/paint-by-example.ipynb # gradio==3.44.1
 notebooks/mobilevlm-language-assistant/mobilevlm-language-assistant.ipynb # transformers<4.35
 notebooks/depth-anything/depth-anything.ipynb # install requirements.txt after clone repo
@@ -22,6 +19,5 @@ notebooks/stable-diffusion-torchdynamo-backend/stable-diffusion-torchdynamo-back
 notebooks/sketch-to-image-pix2pix-turbo/sketch-to-image-pix2pix-turbo.ipynb
 notebooks/yolov10-optimization/yolov10-optimization.ipynb # nncf from git
 notebooks/person-counting-webcam/person-counting.ipynb # numpy should be installed first
-notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb # torchvision < 0.17.0
 notebooks/parler-tts-text-to-speech/parler-tts-text-to-speech.ipynb # torch >= 2.2
 notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb # diffusers from git
diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt
@@ -28,8 +28,8 @@ notebooks/tiny-sd-image-generation/tiny-sd-image-generation.ipynb
 notebooks/zeroscope-text2video/zeroscope-text2video.ipynb
 notebooks/mms-massively-multilingual-speech/mms-massively-multilingual-speech.ipynb
 notebooks/bark-text-to-audio/bark-text-to-audio.ipynb
-notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
-notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot.ipynb
+notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb
+notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
 notebooks/decidiffusion-image-generation/decidiffusion-image-generation.ipynb
 notebooks/pix2struct-docvqa/pix2struct-docvqa.ipynb
 notebooks/fast-segment-anything/fast-segment-anything.ipynb
diff --git a/.ci/skipped_notebooks.yml b/.ci/skipped_notebooks.yml
@@ -218,7 +218,7 @@
         - ubuntu-20.04
         - ubuntu-22.04
         - windows-2019
-- notebook: notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb
+- notebook: notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
   skips:
     - os:
         - macos-12
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -64,7 +64,6 @@
 - [Create an Agentic RAG using OpenVINO and LlamaIndex](./llm-agent-react/llm-agent-rag-llamaindex.ipynb)
 - [Create Function-calling Agent using OpenVINO and Qwen-Agent](./llm-agent-functioncall/llm-agent-functioncall-qwen.ipynb)
 - [Visual-language assistant with LLaVA Next and OpenVINO](./llava-next-multimodal-chatbot/llava-next-multimodal-chatbot.ipynb)
-- [Visual-language assistant with Video-LLaVA and OpenVINO](./llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb)
 - [Visual-language assistant with LLaVA and OpenVINO Generative API](./llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb)
 - [Text-to-Image Generation with LCM LoRA and ControlNet Conditioning](./latent-consistency-models-image-generation/lcm-lora-controlnet.ipynb)
 - [Latent Consistency Model using Optimum-Intel OpenVINO](./latent-consistency-models-image-generation/latent-consistency-models-optimum-demo.ipynb)
@@ -244,7 +243,6 @@
 - [Create an Agentic RAG using OpenVINO and LlamaIndex](./llm-agent-react/llm-agent-rag-llamaindex.ipynb)
 - [Create Function-calling Agent using OpenVINO and Qwen-Agent](./llm-agent-functioncall/llm-agent-functioncall-qwen.ipynb)
 - [Visual-language assistant with LLaVA Next and OpenVINO](./llava-next-multimodal-chatbot/llava-next-multimodal-chatbot.ipynb)
-- [Visual-language assistant with Video-LLaVA and OpenVINO](./llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb)
 - [Visual-language assistant with LLaVA and OpenVINO Generative API](./llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb)
 - [Text-to-Image Generation with LCM LoRA and ControlNet Conditioning](./latent-consistency-models-image-generation/lcm-lora-controlnet.ipynb)
 - [Latent Consistency Model using Optimum-Intel OpenVINO](./latent-consistency-models-image-generation/latent-consistency-models-optimum-demo.ipynb)
diff --git a/notebooks/llava-multimodal-chatbot/README.md b/notebooks/llava-multimodal-chatbot/README.md
@@ -10,30 +10,14 @@ While LLaVA excels at image-based tasks, Video-LLaVA expands this fluency to the
 
 In the field of artificial intelligence, the goal is to create a versatile assistant capable of understanding and executing tasks based on both visual and language inputs. Current approaches often rely on large vision models that solve tasks independently, with language only used to describe image content. While effective, these models have fixed interfaces with limited interactivity and adaptability to user instructions. On the other hand, large language models (LLMs) have shown promise as a universal interface for general-purpose assistants. By explicitly representing various task instructions in language, these models can be guided to switch and solve different tasks. To extend this capability to the multimodal domain, the [LLaVA paper](https://arxiv.org/abs/2304.08485) introduces  `visual instruction-tuning`, a novel approach to building a general-purpose visual assistant. 
 
-In this tutorial series we consider how to use LLaVA and Video-LLaVA model to build multimodal chatbot with OpenVINO help.
-
-## LLaVA
-### Notebook contents
-The tutorial consists from following steps:
-
-- Install prerequisites
-- Prepare input processor and tokenizer
-- Download original model
-- Compress model weights to 4 and 8 bits using NNCF
-- Convert model to OpenVINO Intermediate Representation (IR) format
-- Prepare OpenVINO-based inference pipeline
-- Run OpenVINO model
-
-## Video-LLaVA
-### Notebook contents
-The tutorial consists from following steps:
-
-- Install prerequisites
-- Download original model
-- Compress model weights to 4 and 8 bits using NNCF
-- Convert model to OpenVINO Intermediate Representation (IR) format
-- Prepare OpenVINO-based inference pipeline
-- Run OpenVINO model
+In this tutorial series we consider how to use LLaVA model to build multimodal chatbot with OpenVINO help.
+
+## Visual-language assistant with LLaVA and OpenVINO Generative API
+This [notebook](./llava-multimodal-chatbot-genai.ipynb) demonstrate how to effectively build Visual-Language assistant using [OpenVINO Generative API](https://github.com/openvinotoolkit/openvino.genai).
+
+## Visual-language assistant with LLaVA and Optimum Intel OpenVINO integration
+This [notebook](./llava-multimodal-chatbot-optimum.ipynb) demonstrate how to effectively build Visual-Language assistant using [Optimum Intel](https://huggingface.co/docs/optimum/main/intel/index) OpenVINO integration.
+
 
 ## Installation instructions
 This is a self-contained example that relies solely on its own code.</br>
diff --git a/notebooks/llava-multimodal-chatbot/gradio_helper.py b/notebooks/llava-multimodal-chatbot/gradio_helper.py
@@ -1,10 +1,8 @@
 from pathlib import Path
-from typing import Callable
 import gradio as gr
 
 
 from PIL import Image
-from typing import Callable
 import numpy as np
 import requests
 from threading import Event, Thread
@@ -132,70 +130,98 @@ def generate_and_signal_complete():
     return demo
 
 
-def make_demo_videollava(fn: Callable):
-    examples_dir = Path("Video-LLaVA/videollava/serve/examples")
-    gr.close_all()
-    demo = gr.Interface(
-        fn=fn,
-        inputs=[
-            gr.Image(label="Input Image", type="filepath"),
-            gr.Video(label="Input Video"),
-            gr.Textbox(label="Question"),
-        ],
-        outputs=gr.Textbox(lines=10),
+def make_demo_llava_optimum(model, processor):
+    from transformers import TextIteratorStreamer
+
+    has_additonal_buttons = "undo_button" in inspect.signature(gr.ChatInterface.__init__).parameters
+
+    def bot_streaming(message, history):
+        print(f"message is - {message}")
+        print(f"history is - {history}")
+        files = message["files"] if isinstance(message, dict) else message.files
+        message_text = message["text"] if isinstance(message, dict) else message.text
+        if files:
+            # message["files"][-1] is a Dict or just a string
+            if isinstance(files[-1], dict):
+                image = files[-1]["path"]
+            else:
+                if isinstance(files[-1], (str, Path)):
+                    image = files[-1]
+                else:
+                    image = files[-1] if isinstance(files[-1], (list, tuple)) else files[-1].path
+        else:
+            # if there's no image uploaded for this turn, look for images in the past turns
+            # kept inside tuples, take the last one
+            for hist in history:
+                if type(hist[0]) == tuple:
+                    image = hist[0][0]
+        try:
+            if image is None:
+                # Handle the case where image is None
+                raise gr.Error("You need to upload an image for Llama-3.2-Vision to work. Close the error and try again with an Image.")
+        except NameError:
+            # Handle the case where 'image' is not defined at all
+            raise gr.Error("You need to upload an image for Llama-3.2-Vision to work. Close the error and try again with an Image.")
+
+        conversation = []
+        flag = False
+        for user, assistant in history:
+            if assistant is None:
+                # pass
+                flag = True
+                conversation.extend([{"role": "user", "content": []}])
+                continue
+            if flag == True:
+                conversation[0]["content"] = [{"type": "text", "text": f"{user}"}]
+                conversation.append({"role": "assistant", "text": assistant})
+                flag = False
+                continue
+            conversation.extend([{"role": "user", "content": [{"type": "text", "text": user}]}, {"role": "assistant", "text": assistant}])
+
+        conversation.append({"role": "user", "content": [{"type": "text", "text": f"{message_text}"}, {"type": "image"}]})
+        prompt = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
+        print(f"prompt is -\n{prompt}")
+        image = Image.open(image)
+        inputs = processor(text=prompt, images=image, return_tensors="pt")
+
+        streamer = TextIteratorStreamer(
+            processor,
+            **{
+                "skip_special_tokens": True,
+                "skip_prompt": True,
+                "clean_up_tokenization_spaces": False,
+            },
+        )
+        generation_kwargs = dict(
+            inputs,
+            streamer=streamer,
+            max_new_tokens=1024,
+            do_sample=False,
+            temperature=0.0,
+            eos_token_id=processor.tokenizer.eos_token_id,
+        )
+
+        thread = Thread(target=model.generate, kwargs=generation_kwargs)
+        thread.start()
+
+        buffer = ""
+        for new_text in streamer:
+            buffer += new_text
+            yield buffer
+
+    additional_buttons = {}
+    if has_additonal_buttons:
+        additional_buttons = {"undo_button": None, "retry_button": None}
+
+    demo = gr.ChatInterface(
+        fn=bot_streaming,
+        title="LLaVA OpenVINO Chatbot",
         examples=[
-            [
-                f"{examples_dir}/extreme_ironing.jpg",
-                None,
-                "What is unusual about this image?",
-            ],
-            [
-                f"{examples_dir}/waterview.jpg",
-                None,
-                "What are the things I should be cautious about when I visit here?",
-            ],
-            [
-                f"{examples_dir}/desert.jpg",
-                None,
-                "If there are factual errors in the questions, point it out; if not, proceed answering the question. What’s happening in the desert?",
-            ],
-            [
-                None,
-                f"{examples_dir}/sample_demo_1.mp4",
-                "Why is this video funny?",
-            ],
-            [
-                None,
-                f"{examples_dir}/sample_demo_3.mp4",
-                "Can you identify any safety hazards in this video?",
-            ],
-            [
-                None,
-                f"{examples_dir}/sample_demo_9.mp4",
-                "Describe the video.",
-            ],
-            [
-                None,
-                f"{examples_dir}/sample_demo_22.mp4",
-                "Describe the activity in the video.",
-            ],
-            [
-                f"{examples_dir}/sample_img_22.png",
-                f"{examples_dir}/sample_demo_22.mp4",
-                "Are the instruments in the pictures used in the video?",
-            ],
-            [
-                f"{examples_dir}/sample_img_13.png",
-                f"{examples_dir}/sample_demo_13.mp4",
-                "Does the flag in the image appear in the video?",
-            ],
-            [
-                f"{examples_dir}/sample_img_8.png",
-                f"{examples_dir}/sample_demo_8.mp4",
-                "Are the image and the video depicting the same place?",
-            ],
+            {"text": "What is on the flower?", "files": ["./bee.jpg"]},
+            {"text": "How to make this pastry?", "files": ["./baklava.png"]},
         ],
-        title="Video-LLaVA🚀",
-        allow_flagging="never",
+        stop_btn=None,
+        multimodal=True,
+        **additional_buttons,
     )
     return demo
diff --git a/notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb b/notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-genai.ipynb
diff --git a/notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb b/notebooks/llava-multimodal-chatbot/llava-multimodal-chatbot-optimum.ipynb
diff --git a/notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb b/notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb