basetenlabs
diff --git a/‎README.md
+1-1 b/‎README.md
+1-1
diff --git a/‎docs/examples/models/overview.mdx
+8-8 b/‎docs/examples/models/overview.mdx
+8-8
diff --git a/‎docs/examples/performance/cached-weights.mdx
+219-1 b/‎docs/examples/performance/cached-weights.mdx
+219-1
diff --git a/‎docs/examples/performance/vllm-server.mdx
+95-1 b/‎docs/examples/performance/vllm-server.mdx
+95-1
@@ -8,7 +8,7 @@
 ## Why Truss?
 
 * **Write once, run anywhere:** Package and test model code, weights, and dependencies with a model server that behaves the same in development and production.
-* **Fast developer loop:** Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with Truss' done-for-you model serving environment.
+* **Fast developer loop:** Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with a batteries-included model serving environment.
 * **Support for all Python frameworks**: From `transformers` and `diffusors` to `PyTorch` and `Tensorflow` to `XGBoost` and `sklearn`, Truss supports models created with any framework, even entirely custom models.
 
 See Trusses for popular models including:
 
@@ -1,20 +1,20 @@
 ---
-title: Example models
-description: "Description"
+title: Example foundation models
+description: "Step-by-step packaging instructions"
 ---
 
 <CardGroup cols={3}>
   <Card title="Llama-2" icon="horse" href="/examples/models/llama-2">
-    Lorem
+    A commercially-licensed LLM by Meta
   </Card>
   <Card title="Stable Diffusion XL" icon="palette" href="/examples/models/sdxl">
-    Lorem
+    A text to image model by Stability AI
   </Card>
   <Card title="Whisper" icon="ear-listen" href="/examples/models/whisper">
-    Lorem
+    An audio transcription model by OpenAI
   </Card>
 </CardGroup>
 
-<Card title="More" icon="ear-listen" href="#">
-    Lorem
-  </Card>
+<Card title="More examples on GitHub" icon="github" href="https://github.com/basetenlabs/truss-examples">
+  See Trusses for dozens of models on GitHub.
+</Card>
@@ -1,4 +1,222 @@
 ---
 title: Load cached model weights
-description: "Description"
+description: "Deploy a model with private Hugging Face weights"
 ---
+
+In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo.
+
+<Tip>
+Bundling model weights can significantly reduce cold start times because your instance won't waste time downloading the model weights from Hugging Face's servers.
+</Tip>
+
+We use `Llama-2-7b`, a popular open-source large language model, as an example. In order to follow along with us, you need to request access to Llama 2.
+
+1. First, [sign up for a Hugging Face account](https://huggingface.co/join) if you don't already have one.
+2. Request access to Llama 2 from [Meta's website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
+2. Next, request access to Llama 2 on [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) by clicking the "Request access" button on the model page.
+
+<Tip>
+If you want to deploy on Baseten, you also need to create a Hugging Face API token and add it to your organizations's secrets.
+1. [Create a Hugging Face API token](https://huggingface.co/settings/tokens) and copy it to your clipboard.
+2. Add the token with the key `hf_access_token` to [your organization's secrets](https://app.baseten.co/settings/secrets) on Baseten.
+</Tip>
+
+### Step 0: Initialize Truss
+
+Get started by creating a new Truss:
+
+```sh
+truss init llama-2-7b-chat
+```
+
+Select the `TrussServer` option then hit `y` to confirm Truss creation. Then navigate to the newly created directory:
+
+```sh
+cd llama-2-7b-chat
+```
+
+### Step 1: Implement Llama 2 7B in Truss
+
+Next, we'll fill out the `model.py` file to implement Llama 2 7B in Truss.
+
+
+In `model/model.py`, we write the class `Model` with three member functions:
+
+* `__init__`, which creates an instance of the object with a `_model` property
+* `load`, which runs once when the model server is spun up and loads the `pipeline` model
+* `predict`, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output.
+
+We will also create a helper function `format_prompt` outside of the `Model` class to appropriately format the incoming text according to the Llama 2 specification.
+
+[Read the quickstart guide](/quickstart) for more details on `Model` class implementation.
+
+```python model/model.py
+from typing import Dict, List
+
+import torch
+from transformers import LlamaForCausalLM, LlamaTokenizer
+
+DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
+
+B_INST, E_INST = "[INST]", "[/INST]"
+B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
+
+class Model:
+    def __init__(self, **kwargs) -> None:
+        self._data_dir = kwargs["data_dir"]
+        self._config = kwargs["config"]
+        self._secrets = kwargs["secrets"]
+        self.model = None
+        self.tokenizer = None
+
+    def load(self):
+        self.model = LlamaForCausalLM.from_pretrained(
+            "meta-llama/Llama-2-7b-chat-hf",
+            use_auth_token=self._secrets["hf_access_token"],
+            torch_dtype=torch.float16,
+            device_map="auto"
+        )
+        self.tokenizer = LlamaTokenizer.from_pretrained(
+            "meta-llama/Llama-2-7b-chat-hf",
+            use_auth_token=self._secrets["hf_access_token"]
+        )
+
+    def predict(self, request: Dict) -> Dict[str, List]:
+        prompt = request.pop("prompt")
+        prompt = format_prompt(prompt)
+
+        inputs = tokenizer(prompt, return_tensors="pt")
+
+        outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
+        response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+
+        return {"response": response}
+
+def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
+    return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}"
+```
+
+### Step 2: Set Python dependencies
+
+Now, we can turn our attention to configuring the model server in `config.yaml`.
+
+In addition to `transformers`, Llama 2 has three other dependencies. We list them below as follows:
+
+```yaml config.yaml
+requirements:
+- accelerate==0.21.0
+- safetensors==0.3.2
+- torch==2.0.1
+- transformers==4.30.2
+```
+
+<Note>
+Always pin exact versions for your Python dependencies. The ML/AI space moves fast, so you want to have an up-to-date version of each package while also being protected from breaking changes.
+</Note>
+
+### Step 3: Configure Hugging Face caching
+
+Finally, we can configure Hugging Face caching in `config.yaml` by adding the `hf_cache` key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use.
+
+```yaml config.yaml
+hf_cache:
+- repo_id: "meta-llama/Llama-2-7b-chat-hf"
+  ignore_patterns:
+  - "*.bin"
+```
+
+In this configuration:
+- `meta-llama/Llama-2-7b-chat-hf` is the `repo_id`, pointing to the exact model to cache.
+- We use a wild card to ignore all `.bin` files in the model directory by providing a pattern under `ignore_patterns`. This is because the model weights are stored in `.bin` and `safetensors` format, and we only want to cache the `safetensors` files.
+
+
+### Step 4: Deploy the model
+
+<Note>
+You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. Make sure you added your `HUGGING_FACE_HUB_TOKEN` to your organization's secrets.
+</Note>
+
+We have successfully packaged Llama 2 as a Truss. Let's deploy!
+
+```sh
+truss push --trusted
+```
+
+### Step 5: Invoke the model
+
+You can invoke the model with:
+
+```sh
+truss predict -d '{"prompt": "What is a large language model?"}'
+```
+
+<RequestExample>
+
+```yaml config.yaml
+environment_variables: {}
+external_package_dirs: []
+model_metadata: {}
+model_name: null
+python_version: py39
+requirements:
+- accelerate==0.21.0
+- safetensors==0.3.2
+- torch==2.0.1
+- transformers==4.30.2
+hf_cache:
+- repo_id: "NousResearch/Llama-2-7b-chat-hf"
+  ignore_patterns:
+  - "*.bin"
+resources:
+  cpu: "4"
+  memory: 30Gi
+  use_gpu: True
+  accelerator: A10G
+secrets: {}
+```
+
+```python model/model.py
+from typing import Dict, List
+
+import torch
+from transformers import LlamaForCausalLM, LlamaTokenizer
+
+DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
+
+B_INST, E_INST = "[INST]", "[/INST]"
+B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
+
+class Model:
+    def __init__(self, **kwargs) -> None:
+        self._data_dir = kwargs["data_dir"]
+        self._config = kwargs["config"]
+        self._secrets = kwargs["secrets"]
+        self.model = None
+        self.tokenizer = None
+
+    def load(self):
+        self.model = LlamaForCausalLM.from_pretrained(
+            "meta-llama/Llama-2-7b-chat-hf",
+            use_auth_token=self._secrets["hf_access_token"],
+            torch_dtype=torch.float16,
+            device_map="auto"
+        )
+        self.tokenizer = LlamaTokenizer.from_pretrained(
+            "meta-llama/Llama-2-7b-chat-hf",
+            use_auth_token=self._secrets["hf_access_token"]
+        )
+
+    def predict(self, request: Dict) -> Dict[str, List]:
+        prompt = request.pop("prompt")
+        inputs = tokenizer(prompt, return_tensors="pt")
+
+        outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
+        response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+
+        return {"response": response}
+
+def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
+    return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}"
+```
+
+</RequestExample>
@@ -1,4 +1,98 @@
 ---
 title: Serve models with VLLM
-description: "Description"
+description: "Deploy a language model using vLLM"
 ---
+
+[vLLM](https://github.com/vllm-project/vllm) is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease. We're going to walk through deploying a vLLM-optimized [OPT-125M model](https://huggingface.co/facebook/opt-125m).
+
+<Tip>
+You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.
+</Tip>
+
+This example will cover:
+
+1. Generating the base Truss
+2. Setting sufficient model resources for inference
+3. Deploying the model
+
+### Step 1: Generating the base Truss
+
+Get started by creating a new Truss:
+
+```sh
+truss init opt125
+```
+
+You're going to see a couple of prompts. Follow along with the instructions below:
+1. Type `facebook/opt-125M` when prompted for `model`.
+2. Press the `tab` key when prompted for `endpoint`. Select the `Completions` endpoint.
+3. Give your model a name like `OPT-125M`.
+
+<Note>
+The underlying server that we use is OpenAI compatible. If you plan on using the model as a chat model, then select `ChatCompletion`. OPT-125M is not a chat model so we selected `Completion`.
+</Note>
+
+Finally, navigate to the directory:
+
+```sh
+cd opt125
+```
+
+### Step 2: Setting resources and other arguments
+
+You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor.
+
+OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following:
+
+```yaml config.yaml
+resources:
+  accelerator: T4
+  cpu: "4"
+  memory: 16Gi
+  use_gpu: true
+```
+
+Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py).
+
+### Step 3: Deploy the model
+
+<Note>
+You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step.
+</Note>
+
+Let's deploy our OPT-125M vLLM model.
+
+```sh
+truss push
+```
+
+You can invoke the model with:
+
+```sh
+truss predict -d '{"prompt": "What is a large language model?"}'
+```
+
+<RequestExample>
+
+```yaml config.yaml
+build:
+  arguments:
+    endpoint: Completions
+    model: facebook/opt-125M
+  model_server: VLLM
+environment_variables: {}
+external_package_dirs: []
+model_metadata: {}
+model_name: OPT-125M
+python_version: py39
+requirements: []
+resources:
+  accelerator: T4
+  cpu: "4"
+  memory: 16Gi
+  use_gpu: true
+secrets: {}
+system_packages: []
+```
+
+</RequestExample>