notebook update

ofirzaf · ofirzaf · commit 1749aac99832 · 2024-03-15T14:18:28.000+02:00
diff --git a/notebooks/openvino/phi-2_on_mtl.ipynb b/notebooks/openvino/phi-2_on_mtl.ipynb
@@ -5,10 +5,13 @@
    "id": "aeb16663-be53-4260-b62d-44611b6771ec",
    "metadata": {},
    "source": [
-    "# Chat and Code with Phi-2 with OpenVINO™ and 🤗 Optimum on Intel® Meteor Lake iGPU\n",
-    "In this notebook we will show how to export and quantize Phi-2 to 4 bits.\n",
+    "# Chat and Code with Phi-2 with OpenVINO and 🤗 Optimum on Intel Meteor Lake iGPU\n",
+    "In this notebook we will show how to export and apply weight only quantization on Phi-2 to 4 bits.\n",
     "Then using the quantized model we will show how to generate code completions with the model running on Intel Meteor Lake iGPU presenting a good experience of running GenAI locally on Intel PC marking the start of the AIPC Era!\n",
-    "Then we will show how to talk with Phi-2 in a ChatBot demo running completley locally on your Laptop!"
+    "Then we will show how to talk with Phi-2 in a ChatBot demo running completely locally on your Laptop!\n",
+    "\n",
+    "[Phi-2](https://huggingface.co/microsoft/phi-2) is a 2.7 billion-parameter language model trained by Microsoft. Microsoft in the model's release [blog post](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/) states that Phi-2:\n",
+    ">   demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation."
    ]
   },
   {
@@ -19,7 +22,7 @@
     "## Install dependencies\n",
     "Make sure you have the latest GPU drivers installed on your machine: https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html.\n",
     "\n",
-    "We will start by installing the dependencies, you can either uncomment the following cell and run it."
+    "We will start by installing the dependencies, that can be done by uncommenting the following cell and run it."
    ]
   },
   {
@@ -51,7 +54,18 @@
    "metadata": {},
    "source": [
     "## Configuration\n",
-    "Here we will configure which model to load and other attributes. We will explain everything :)"
+    "Here we will configure which model to load and other attributes. We will explain everything 😄\n",
+    "* `model_name`: the name or path of the model we want to export and quantize, can be either on the 🤗 Hub or a local directory on your laptop.\n",
+    "* `save_name`: directory where the exported & quantized model will be saved.\n",
+    "* `precision`: the compute data type we will use for inference of the model, can be either `f32` or `f16`. We use FP32 precision due to Phi-2 overflow issues in FP16.\n",
+    "* `quantization_config`: here we set the attributes for the weight only quantization algorithm:\n",
+    "    * `bits`: number of bits to use for quantization, can be either `8` or `4`.\n",
+    "    * `sym`: whether to use symmetric quantization or not, can be either `True` or `False`.\n",
+    "    * `group_size`: number of weights to group together for quantization. We use groups of 128 to ensure no accuracy degradation.\n",
+    "    * `ratio`: the ratio of the model to quantize to #`bits`. The rest will be quantize to the default bits number, `8`.\n",
+    "* `device`:  the device to use for inference, can be either `cpu` or `gpu`.\n",
+    "* `stateful`: Optimize model by setting the KV cache as part of the models state instead of as an input\n",
+    "\n"
    ]
   },
   {
@@ -62,16 +76,16 @@
    "outputs": [],
    "source": [
     "model_name = 'microsoft/phi-2'\n",
-    "save_name = './phi-2-woq4'  # Directory where the exported & quantized model will be saved\n",
-    "precision = 'f32'  # We use FP32 precision due to Phi-2 overflow issues in FP16.\n",
+    "save_name = './phi-2-woq4'\n",
+    "precision = 'f32'\n",
     "quantization_config = OVWeightQuantizationConfig(\n",
     "    bits=4,\n",
-    "    sym=False,  # Use asymmetric quantization\n",
-    "    group_size=128,  # Quantize weights in groups of 128 to ensure no accuracy degradation\n",
-    "    ratio=0.8,  # 80% of the model layers will be quantized to 4bit, the rest will be quantized to 8bit.\n",
+    "    sym=False,\n",
+    "    group_size=128,\n",
+    "    ratio=0.8,\n",
     ")\n",
-    "device = 'gpu'  # choose from ['cpu', 'gpu']\n",
-    "stateful = True  # Optimize model by setting the KV cache as part of the models state instead of as an input"
+    "device = 'gpu'\n",
+    "stateful = True "
    ]
   },
   {
@@ -444,7 +458,7 @@
     "\n",
     "\n",
     "with gr.Blocks(theme=gr.themes.Soft()) as demo:\n",
-    "    gr.Markdown('<h1 style=\"text-align: center;\">Talk with Phi-2 on Meteor Lake iGPU</h1>')\n",
+    "    gr.Markdown('<h1 style=\"text-align: center;\">Chat with Phi-2 on Meteor Lake iGPU</h1>')\n",
     "    chatbot = gr.Chatbot()\n",
     "    with gr.Row():\n",
     "        msg = gr.Textbox(placeholder=\"Enter message here...\", show_label=False, autofocus=True, scale=75)\n",