|
5 | 5 | "id": "aeb16663-be53-4260-b62d-44611b6771ec",
|
6 | 6 | "metadata": {},
|
7 | 7 | "source": [
|
8 |
| - "# Chat and Code with Phi-2 with OpenVINO™ and 🤗 Optimum on Intel® Meteor Lake iGPU\n", |
9 |
| - "In this notebook we will show how to export and quantize Phi-2 to 4 bits.\n", |
| 8 | + "# Chat and Code with Phi-2 with OpenVINO and 🤗 Optimum on Intel Meteor Lake iGPU\n", |
| 9 | + "In this notebook we will show how to export and apply weight only quantization on Phi-2 to 4 bits.\n", |
10 | 10 | "Then using the quantized model we will show how to generate code completions with the model running on Intel Meteor Lake iGPU presenting a good experience of running GenAI locally on Intel PC marking the start of the AIPC Era!\n",
|
11 |
| - "Then we will show how to talk with Phi-2 in a ChatBot demo running completley locally on your Laptop!" |
| 11 | + "Then we will show how to talk with Phi-2 in a ChatBot demo running completely locally on your Laptop!\n", |
| 12 | + "\n", |
| 13 | + "[Phi-2](https://huggingface.co/microsoft/phi-2) is a 2.7 billion-parameter language model trained by Microsoft. Microsoft in the model's release [blog post](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/) states that Phi-2:\n", |
| 14 | + "> demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation." |
12 | 15 | ]
|
13 | 16 | },
|
14 | 17 | {
|
|
19 | 22 | "## Install dependencies\n",
|
20 | 23 | "Make sure you have the latest GPU drivers installed on your machine: https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html.\n",
|
21 | 24 | "\n",
|
22 |
| - "We will start by installing the dependencies, you can either uncomment the following cell and run it." |
| 25 | + "We will start by installing the dependencies, that can be done by uncommenting the following cell and run it." |
23 | 26 | ]
|
24 | 27 | },
|
25 | 28 | {
|
|
51 | 54 | "metadata": {},
|
52 | 55 | "source": [
|
53 | 56 | "## Configuration\n",
|
54 |
| - "Here we will configure which model to load and other attributes. We will explain everything :)" |
| 57 | + "Here we will configure which model to load and other attributes. We will explain everything 😄\n", |
| 58 | + "* `model_name`: the name or path of the model we want to export and quantize, can be either on the 🤗 Hub or a local directory on your laptop.\n", |
| 59 | + "* `save_name`: directory where the exported & quantized model will be saved.\n", |
| 60 | + "* `precision`: the compute data type we will use for inference of the model, can be either `f32` or `f16`. We use FP32 precision due to Phi-2 overflow issues in FP16.\n", |
| 61 | + "* `quantization_config`: here we set the attributes for the weight only quantization algorithm:\n", |
| 62 | + " * `bits`: number of bits to use for quantization, can be either `8` or `4`.\n", |
| 63 | + " * `sym`: whether to use symmetric quantization or not, can be either `True` or `False`.\n", |
| 64 | + " * `group_size`: number of weights to group together for quantization. We use groups of 128 to ensure no accuracy degradation.\n", |
| 65 | + " * `ratio`: the ratio of the model to quantize to #`bits`. The rest will be quantize to the default bits number, `8`.\n", |
| 66 | + "* `device`: the device to use for inference, can be either `cpu` or `gpu`.\n", |
| 67 | + "* `stateful`: Optimize model by setting the KV cache as part of the models state instead of as an input\n", |
| 68 | + "\n" |
55 | 69 | ]
|
56 | 70 | },
|
57 | 71 | {
|
|
62 | 76 | "outputs": [],
|
63 | 77 | "source": [
|
64 | 78 | "model_name = 'microsoft/phi-2'\n",
|
65 |
| - "save_name = './phi-2-woq4' # Directory where the exported & quantized model will be saved\n", |
66 |
| - "precision = 'f32' # We use FP32 precision due to Phi-2 overflow issues in FP16.\n", |
| 79 | + "save_name = './phi-2-woq4'\n", |
| 80 | + "precision = 'f32'\n", |
67 | 81 | "quantization_config = OVWeightQuantizationConfig(\n",
|
68 | 82 | " bits=4,\n",
|
69 |
| - " sym=False, # Use asymmetric quantization\n", |
70 |
| - " group_size=128, # Quantize weights in groups of 128 to ensure no accuracy degradation\n", |
71 |
| - " ratio=0.8, # 80% of the model layers will be quantized to 4bit, the rest will be quantized to 8bit.\n", |
| 83 | + " sym=False,\n", |
| 84 | + " group_size=128,\n", |
| 85 | + " ratio=0.8,\n", |
72 | 86 | ")\n",
|
73 |
| - "device = 'gpu' # choose from ['cpu', 'gpu']\n", |
74 |
| - "stateful = True # Optimize model by setting the KV cache as part of the models state instead of as an input" |
| 87 | + "device = 'gpu'\n", |
| 88 | + "stateful = True " |
75 | 89 | ]
|
76 | 90 | },
|
77 | 91 | {
|
|
444 | 458 | "\n",
|
445 | 459 | "\n",
|
446 | 460 | "with gr.Blocks(theme=gr.themes.Soft()) as demo:\n",
|
447 |
| - " gr.Markdown('<h1 style=\"text-align: center;\">Talk with Phi-2 on Meteor Lake iGPU</h1>')\n", |
| 461 | + " gr.Markdown('<h1 style=\"text-align: center;\">Chat with Phi-2 on Meteor Lake iGPU</h1>')\n", |
448 | 462 | " chatbot = gr.Chatbot()\n",
|
449 | 463 | " with gr.Row():\n",
|
450 | 464 | " msg = gr.Textbox(placeholder=\"Enter message here...\", show_label=False, autofocus=True, scale=75)\n",
|
|
0 commit comments