Add notebook for llava multimodal chatbot (openvinotoolkit#1353)

eaidova · web-flow · commit 7e950767c325 · 2023-10-05T14:44:34.000+04:00
* Add notebook for llava multimodal chatbot

* readme, text, requirements

* spell check

* gradio

* code style

* fix readme

* refactoring and better explanation

* Update README.md
diff --git a/.ci/ignore_convert_execution.txt b/.ci/ignore_convert_execution.txt
@@ -28,4 +28,5 @@ notebooks/248-stable-diffusion-xl/248-stable-diffusion-xl.ipynb
 notebooks/249-oneformer-segmentation/249-oneformer-segmentation.ipynb
 notebooks/251-tiny-sd-image-generation/251-tiny-sd-image-generation.ipynb
 notebooks/254-llm-chatbot/254-llm-chatbot.ipynb
+notebooks/257-llava-multimodal-chatbot/257-llava-multimodal-chatbot.ipynb
 notebooks/404-style-transfer-webcam/404-style-transfer.ipynb
diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt
@@ -18,6 +18,7 @@
 254-llm-chatbot
 255-mms-massively-multilingual-speech
 256-bark-text-to-audio
+257-llava-multimodal-chatbot
 301-tensorflow-training-openvino
 305-tensorflow-quantization-aware-training
 404-style-transfer-webcam
diff --git a/.ci/ignore_treon_linux.txt b/.ci/ignore_treon_linux.txt
@@ -20,4 +20,5 @@
 254-llm-chatbot
 255-mms-massively-multilingual-speech
 256-bark-text-to-audio
+257-llava-multimodal-chatbot
 404-style-transfer-webcam
diff --git a/.ci/ignore_treon_mac.txt b/.ci/ignore_treon_mac.txt
@@ -20,4 +20,5 @@
 254-llm-chatbot
 255-mms-massively-multilingual-speech
 256-bark-text-to-audio
+257-llava-multimodal-chatbot
 404-style-transfer-webcam
diff --git a/.ci/ignore_treon_win.txt b/.ci/ignore_treon_win.txt
@@ -22,3 +22,4 @@
 254-llm-chatbot
 255-mms-massively-multilingual-speech
 256-bark-text-to-audio
+257-llava-multimodal-chatbot
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -240,13 +240,15 @@ LaBSE
 LAION
 Lasinger
 latents
-LLaMa
 LeViT
 LibriSpeech
 librispeech
 Lim
 Liu
 LLama
+LLaMa
+LLaVA
+llava
 llm
 LLM
 LLMs
diff --git a/README.md b/README.md
@@ -32,6 +32,7 @@ Check out the latest notebooks that show how to optimize and deploy popular mode
 | [ZeroScope Text-to-video synthesis](notebooks/253-zeroscope-text2video)<br> | Text-to-video synthesis with ZeroScope and OpenVINO™ | A panda eating bamboo on a rock <img src="https://github.com/itrushkin/openvino_notebooks/assets/76161256/500956d5-4aac-4710-a77c-4df34bcda3be" width=300> |
 | [LLM chatbot](notebooks/254-llm-chatbot)<br> | Create LLM-powered Chatbot using OpenVINO™ |  <img src=https://user-images.githubusercontent.com/29454499/255799218-611e7189-8979-4ef5-8a80-5a75e0136b50.png width=600> |
 | [Bark Text-to-Speech](notebooks/256-bark-text-to-audio/)<br> | Text-to-Speech generation using Bark and OpenVINO™ | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/9a770279-0045-480e-95f2-1a2f2d0a5115 width=300>
+| [LLaVA Multimodal Chatbot](notebooks/257-llava-multimodal-chatbot/)<br> | Visual-language assistant with LLaVA and OpenVINO™ | <img src=https://raw.githubusercontent.com/haotian-liu/LLaVA/main/images/llava_logo.png width=300>
 
 ## Table of Contents
 
@@ -186,6 +187,7 @@ Demos that demonstrate inference on a particular model.
 | [254-llm-chatbot](notebooks/254-llm-chatbot)<br> | Create LLM-powered Chatbot using OpenVINO™ |  <img src=https://user-images.githubusercontent.com/29454499/255799218-611e7189-8979-4ef5-8a80-5a75e0136b50.png width=600> |
 | [255-mms-massively-multilingual-speech](notebooks/255-mms-massively-multilingual-speech/)<br> | MMS: Scaling Speech Technology to 1000+ languages with OpenVINO™ | |
 | [256-bark-text-to-audio](notebooks/256-bark-text-to-audio)<br> | Text-to-Speech generation using Bark and OpenVINO™ |  <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/9a770279-0045-480e-95f2-1a2f2d0a5115 width=225> |
+| [257-llava-multimodal-chatbot](notebooks/257-llava-multimodal-chatbot)<br> | Visual-language assistant with LLaVA and OpenVINO™ |  <img src=https://raw.githubusercontent.com/haotian-liu/LLaVA/main/images/llava_logo.png width=225> |
 
 <div id='-model-training'></div>
 
diff --git a/notebooks/257-llava-multimodal-chatbot/257-llava-multimodal-chatbot.ipynb b/notebooks/257-llava-multimodal-chatbot/257-llava-multimodal-chatbot.ipynb
diff --git a/notebooks/257-llava-multimodal-chatbot/README.md b/notebooks/257-llava-multimodal-chatbot/README.md
@@ -0,0 +1,27 @@
+# Visual-language assistant with LLaVA and OpenVINO
+
+![llava_logo.png](https://raw.githubusercontent.com/haotian-liu/LLaVA/main/images/llava_logo.png)
+
+*image source: [LLaVA repository](https://github.com/haotian-liu/LLaVA/blob/main/images/llava_logo.png)*
+
+[LLaVA](https://llava-vl.github.io) (Large Language and Vision Assistant) is large multimodal model that aims to develop a general-purpose visual assistant that can follow both language and image instructions to complete various real-world tasks. The idea is to combine the power of large language models (LLMs) with vision encoders like CLIP to create an end-to-end trained neural assistant that understands and acts upon multimodal instructions.
+
+In the field of artificial intelligence, the goal is to create a versatile assistant capable of understanding and executing tasks based on both visual and language inputs. Current approaches often rely on large vision models that solve tasks independently, with language only used to describe image content. While effective, these models have fixed interfaces with limited interactivity and adaptability to user instructions. On the other hand, large language models (LLMs) have shown promise as a universal interface for general-purpose assistants. By explicitly representing various task instructions in language, these models can be guided to switch and solve different tasks. To extend this capability to the multimodal domain, the [LLaVA paper](https://arxiv.org/abs/2304.08485) introduces  `visual instruction-tuning`, a novel approach to building a general-purpose visual assistant. 
+
+In this tutorial we consider how to use LLaVA model to build multimodal chatbot with OpenVINO help.
+
+## Notebook contents
+The tutorial consists from following steps:
+
+- Install prerequisites
+- Prepare input processor and tokenizer
+- Download original model
+- Compress model weights to INT8 using NNCF
+- Convert model to OpenVINO Intermediate Representation (IR) format
+- Prepare OpenVINO-based inference pipeline
+- Run OpenVINO model
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).