Skip to content

Commit b0644bc

Browse files
authored
pixtral notebook (openvinotoolkit#2426)
1 parent bd09273 commit b0644bc

File tree

6 files changed

+592
-1
lines changed

6 files changed

+592
-1
lines changed

.ci/ignore_treon_docker.txt

+2-1
Original file line numberDiff line numberDiff line change
@@ -82,4 +82,5 @@ notebooks/qwen2-vl/qwen2-vl.ipynb
8282
notebooks/qwen2-audio/qwen2-audio.ipynb
8383
notebooks/stable-fast-3d/stable-fast-3d.ipynb
8484
notebooks/mllama-3.2/mllama-3.2.ipynb
85-
notebooks/segment-anything/segment-anything-2-image.ipynb
85+
notebooks/segment-anything/segment-anything-2-image.ipynb
86+
notebooks/pixtral/pixtral.ipynb

.ci/skipped_notebooks.yml

+7
Original file line numberDiff line numberDiff line change
@@ -583,3 +583,10 @@
583583
- '3.9'
584584
- os:
585585
- macos-12
586+
- notebook: notebooks/pixtral/pixtral.ipynb
587+
skips:
588+
- os:
589+
- macos-12
590+
- ubuntu-20.04
591+
- ubuntu-22.04
592+
- windows-2019

.ci/spellcheck/.pyspelling.wordlist.txt

+3
Original file line numberDiff line numberDiff line change
@@ -527,6 +527,7 @@ nar
527527
NAS
528528
natively
529529
NCE
530+
Nemo
530531
NEOX
531532
NER
532533
NETP
@@ -633,6 +634,8 @@ PixArt
633634
PIXART
634635
PixelShuffleUpsampleNetwork
635636
pixelwise
637+
Pixtral
638+
pixtral
636639
PIL
637640
PNDM
638641
png

notebooks/pixtral/README.md

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
Visual-language assistant with Pixtral and OpenVINO
2+
3+
Pixtral-12b is multimodal model that consists of 12B parameter multimodal decoder based on Mistral Nemo and 400M parameter vision encoder trained from scratch. It is trained to understand both natural images and documents. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Unlike previous open-source models, Pixtral does not compromise on text benchmark performance to excel in multimodal tasks.
4+
5+
![](https://mistral.ai/images/news/pixtral-12b/pixtral-model-architecture.png)
6+
7+
8+
More details about model are available in [blog post](https://mistral.ai/news/pixtral-12b/) and [model card](https://huggingface.co/mistralai/Pixtral-12B-2409)
9+
10+
In this tutorial we consider how to convert, optimize and run this model using OpenVINO.
11+
12+
## Notebook contents
13+
The tutorial consists from following steps:
14+
15+
- Install requirements
16+
- Convert and Optimize model
17+
- Run OpenVINO model inference
18+
- Launch Interactive demo
19+
20+
In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content.
21+
22+
The image bellow illustrates example of input prompt and model answer.
23+
![example.png](https://github.com/user-attachments/assets/b61a9e8e-32c7-4b60-aa00-b4b867e823be)
24+
25+
## Installation instructions
26+
This is a self-contained example that relies solely on its own code.</br>
27+
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
28+
For details, please refer to [Installation Guide](../../README.md).
29+
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/pixtral/README.md" />

notebooks/pixtral/gradio_helper.py

+122
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
from pathlib import Path
2+
import requests
3+
import gradio as gr
4+
from PIL import Image
5+
from threading import Thread
6+
from transformers import TextIteratorStreamer
7+
8+
chat_template = """
9+
{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n {%- endif %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST]\" + system_message + \"\n\n\" }}\n {%- else %}\n {{- \"[INST]\" }}\n {%- endif %}\n {%- if message[\"content\"] is not string %}\n {%- for chunk in message[\"content\"] %}\n {%- if chunk[\"type\"] == \"text\" %}\n {{- chunk[\"content\"] }}\n {%- elif chunk[\"type\"] == \"image\" %}\n {{- \"[IMG]\" }}\n {%- else %}\n {{- raise_exception(\"Unrecognized content type!\") }}\n {%- endif %}\n {%- endfor %}\n {%- else %}\n {{- message[\"content\"] }}\n {%- endif %}\n {{- \"[/INST]\" }}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- message[\"content\"] + eos_token}}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}
10+
"""
11+
12+
13+
def resize_with_aspect_ratio(image: Image, dst_height=512, dst_width=512):
14+
width, height = image.size
15+
if width > dst_width or height > dst_height:
16+
im_scale = min(dst_height / height, dst_width / width)
17+
resize_size = (int(width * im_scale), int(height * im_scale))
18+
return image.resize(resize_size)
19+
return image
20+
21+
22+
def make_demo(model, processor):
23+
model_name = Path(model.config._name_or_path).parent.name
24+
25+
example_image_urls = [
26+
("https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/dd5105d6-6a64-4935-8a34-3058a82c8d5d", "small.png"),
27+
("https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/1221e2a8-a6da-413a-9af6-f04d56af3754", "chart.png"),
28+
]
29+
30+
for url, file_name in example_image_urls:
31+
if not Path(file_name).exists():
32+
Image.open(requests.get(url, stream=True).raw).save(file_name)
33+
if processor.chat_template is None:
34+
processor.set_chat_template(chat_template)
35+
36+
def bot_streaming(message, history):
37+
print(f"message is - {message}")
38+
print(f"history is - {history}")
39+
files = message["files"] if isinstance(message, dict) else message.files
40+
message_text = message["text"] if isinstance(message, dict) else message.text
41+
if files:
42+
# message["files"][-1] is a Dict or just a string
43+
if isinstance(files[-1], dict):
44+
image = files[-1]["path"]
45+
else:
46+
image = files[-1] if isinstance(files[-1], (list, tuple)) else files[-1].path
47+
else:
48+
# if there's no image uploaded for this turn, look for images in the past turns
49+
# kept inside tuples, take the last one
50+
for hist in history:
51+
if type(hist[0]) == tuple:
52+
image = hist[0][0]
53+
try:
54+
if image is None:
55+
# Handle the case where image is None
56+
raise gr.Error("You need to upload an image for Llama-3.2-Vision to work. Close the error and try again with an Image.")
57+
except NameError:
58+
# Handle the case where 'image' is not defined at all
59+
raise gr.Error("You need to upload an image for Llama-3.2-Vision to work. Close the error and try again with an Image.")
60+
61+
conversation = []
62+
flag = False
63+
for user, assistant in history:
64+
if assistant is None:
65+
# pass
66+
flag = True
67+
conversation.extend([{"role": "user", "content": []}])
68+
continue
69+
if flag == True:
70+
conversation[0]["content"] = [{"type": "text", "content": f"{user}"}]
71+
conversation.append({"role": "assistant", "content": assistant})
72+
flag = False
73+
continue
74+
conversation.extend([{"role": "user", "content": [{"type": "text", "content": user}]}, {"role": "assistant", "content": assistant}])
75+
76+
conversation.append({"role": "user", "content": [{"type": "text", "content": f"{message_text}"}, {"type": "image"}]})
77+
prompt = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
78+
print(f"prompt is -\n{prompt}")
79+
image = Image.open(image)
80+
image = resize_with_aspect_ratio(image)
81+
inputs = processor(prompt, image, return_tensors="pt")
82+
83+
streamer = TextIteratorStreamer(
84+
processor,
85+
**{
86+
"skip_special_tokens": True,
87+
"skip_prompt": True,
88+
"clean_up_tokenization_spaces": False,
89+
},
90+
)
91+
generation_kwargs = dict(
92+
inputs,
93+
streamer=streamer,
94+
max_new_tokens=1024,
95+
do_sample=False,
96+
temperature=0.0,
97+
eos_token_id=processor.tokenizer.eos_token_id,
98+
)
99+
100+
thread = Thread(target=model.generate, kwargs=generation_kwargs)
101+
thread.start()
102+
103+
buffer = ""
104+
for new_text in streamer:
105+
buffer += new_text
106+
yield buffer
107+
108+
demo = gr.ChatInterface(
109+
fn=bot_streaming,
110+
title=f"{model_name} with OpenVINO",
111+
examples=[
112+
{"text": "What is the text saying?", "files": ["./small.png"]},
113+
{"text": "What does the chart display?", "files": ["./chart.png"]},
114+
],
115+
description=f"{model_name} with OpenVINO. Upload an image and start chatting about it, or simply try one of the examples below. If you won't upload an image, you will receive an error.",
116+
stop_btn=None,
117+
retry_btn=None,
118+
undo_btn=None,
119+
multimodal=True,
120+
)
121+
122+
return demo

0 commit comments

Comments
 (0)