|
1 | 1 | ---
|
2 | 2 | title: Load cached model weights
|
3 |
| -description: "Description" |
| 3 | +description: "Deploy a model with private Hugging Face weights" |
4 | 4 | ---
|
| 5 | + |
| 6 | +In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo. |
| 7 | + |
| 8 | +<Tip> |
| 9 | +Bundling model weights can significantly reduce cold start times because your instance won't waste time downloading the model weights from Hugging Face's servers. |
| 10 | +</Tip> |
| 11 | + |
| 12 | +We use `Llama-2-7b`, a popular open-source large language model, as an example. In order to follow along with us, you need to request access to Llama 2. |
| 13 | + |
| 14 | +1. First, [sign up for a Hugging Face account](https://huggingface.co/join) if you don't already have one. |
| 15 | +2. Request access to Llama 2 from [Meta's website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). |
| 16 | +2. Next, request access to Llama 2 on [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) by clicking the "Request access" button on the model page. |
| 17 | + |
| 18 | +<Tip> |
| 19 | +If you want to deploy on Baseten, you also need to create a Hugging Face API token and add it to your organizations's secrets. |
| 20 | +1. [Create a Hugging Face API token](https://huggingface.co/settings/tokens) and copy it to your clipboard. |
| 21 | +2. Add the token with the key `hf_access_token` to [your organization's secrets](https://app.baseten.co/settings/secrets) on Baseten. |
| 22 | +</Tip> |
| 23 | + |
| 24 | +### Step 0: Initialize Truss |
| 25 | + |
| 26 | +Get started by creating a new Truss: |
| 27 | + |
| 28 | +```sh |
| 29 | +truss init llama-2-7b-chat |
| 30 | +``` |
| 31 | + |
| 32 | +Select the `TrussServer` option then hit `y` to confirm Truss creation. Then navigate to the newly created directory: |
| 33 | + |
| 34 | +```sh |
| 35 | +cd llama-2-7b-chat |
| 36 | +``` |
| 37 | + |
| 38 | +### Step 1: Implement Llama 2 7B in Truss |
| 39 | + |
| 40 | +Next, we'll fill out the `model.py` file to implement Llama 2 7B in Truss. |
| 41 | + |
| 42 | + |
| 43 | +In `model/model.py`, we write the class `Model` with three member functions: |
| 44 | + |
| 45 | +* `__init__`, which creates an instance of the object with a `_model` property |
| 46 | +* `load`, which runs once when the model server is spun up and loads the `pipeline` model |
| 47 | +* `predict`, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output. |
| 48 | + |
| 49 | +We will also create a helper function `format_prompt` outside of the `Model` class to appropriately format the incoming text according to the Llama 2 specification. |
| 50 | + |
| 51 | +[Read the quickstart guide](/quickstart) for more details on `Model` class implementation. |
| 52 | + |
| 53 | +```python model/model.py |
| 54 | +from typing import Dict, List |
| 55 | + |
| 56 | +import torch |
| 57 | +from transformers import LlamaForCausalLM, LlamaTokenizer |
| 58 | + |
| 59 | +DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant." |
| 60 | + |
| 61 | +B_INST, E_INST = "[INST]", "[/INST]" |
| 62 | +B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n" |
| 63 | + |
| 64 | +class Model: |
| 65 | + def __init__(self, **kwargs) -> None: |
| 66 | + self._data_dir = kwargs["data_dir"] |
| 67 | + self._config = kwargs["config"] |
| 68 | + self._secrets = kwargs["secrets"] |
| 69 | + self.model = None |
| 70 | + self.tokenizer = None |
| 71 | + |
| 72 | + def load(self): |
| 73 | + self.model = LlamaForCausalLM.from_pretrained( |
| 74 | + "meta-llama/Llama-2-7b-chat-hf", |
| 75 | + use_auth_token=self._secrets["hf_access_token"], |
| 76 | + torch_dtype=torch.float16, |
| 77 | + device_map="auto" |
| 78 | + ) |
| 79 | + self.tokenizer = LlamaTokenizer.from_pretrained( |
| 80 | + "meta-llama/Llama-2-7b-chat-hf", |
| 81 | + use_auth_token=self._secrets["hf_access_token"] |
| 82 | + ) |
| 83 | + |
| 84 | + def predict(self, request: Dict) -> Dict[str, List]: |
| 85 | + prompt = request.pop("prompt") |
| 86 | + prompt = format_prompt(prompt) |
| 87 | + |
| 88 | + inputs = tokenizer(prompt, return_tensors="pt") |
| 89 | + |
| 90 | + outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100) |
| 91 | + response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] |
| 92 | + |
| 93 | + return {"response": response} |
| 94 | + |
| 95 | +def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str: |
| 96 | + return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}" |
| 97 | +``` |
| 98 | + |
| 99 | +### Step 2: Set Python dependencies |
| 100 | + |
| 101 | +Now, we can turn our attention to configuring the model server in `config.yaml`. |
| 102 | + |
| 103 | +In addition to `transformers`, Llama 2 has three other dependencies. We list them below as follows: |
| 104 | + |
| 105 | +```yaml config.yaml |
| 106 | +requirements: |
| 107 | +- accelerate==0.21.0 |
| 108 | +- safetensors==0.3.2 |
| 109 | +- torch==2.0.1 |
| 110 | +- transformers==4.30.2 |
| 111 | +``` |
| 112 | +
|
| 113 | +<Note> |
| 114 | +Always pin exact versions for your Python dependencies. The ML/AI space moves fast, so you want to have an up-to-date version of each package while also being protected from breaking changes. |
| 115 | +</Note> |
| 116 | +
|
| 117 | +### Step 3: Configure Hugging Face caching |
| 118 | +
|
| 119 | +Finally, we can configure Hugging Face caching in `config.yaml` by adding the `hf_cache` key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use. |
| 120 | + |
| 121 | +```yaml config.yaml |
| 122 | +hf_cache: |
| 123 | +- repo_id: "meta-llama/Llama-2-7b-chat-hf" |
| 124 | + ignore_patterns: |
| 125 | + - "*.bin" |
| 126 | +``` |
| 127 | + |
| 128 | +In this configuration: |
| 129 | +- `meta-llama/Llama-2-7b-chat-hf` is the `repo_id`, pointing to the exact model to cache. |
| 130 | +- We use a wild card to ignore all `.bin` files in the model directory by providing a pattern under `ignore_patterns`. This is because the model weights are stored in `.bin` and `safetensors` format, and we only want to cache the `safetensors` files. |
| 131 | + |
| 132 | + |
| 133 | +### Step 4: Deploy the model |
| 134 | + |
| 135 | +<Note> |
| 136 | +You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. Make sure you added your `HUGGING_FACE_HUB_TOKEN` to your organization's secrets. |
| 137 | +</Note> |
| 138 | + |
| 139 | +We have successfully packaged Llama 2 as a Truss. Let's deploy! |
| 140 | + |
| 141 | +```sh |
| 142 | +truss push --trusted |
| 143 | +``` |
| 144 | + |
| 145 | +### Step 5: Invoke the model |
| 146 | + |
| 147 | +You can invoke the model with: |
| 148 | + |
| 149 | +```sh |
| 150 | +truss predict -d '{"prompt": "What is a large language model?"}' |
| 151 | +``` |
| 152 | + |
| 153 | +<RequestExample> |
| 154 | + |
| 155 | +```yaml config.yaml |
| 156 | +environment_variables: {} |
| 157 | +external_package_dirs: [] |
| 158 | +model_metadata: {} |
| 159 | +model_name: null |
| 160 | +python_version: py39 |
| 161 | +requirements: |
| 162 | +- accelerate==0.21.0 |
| 163 | +- safetensors==0.3.2 |
| 164 | +- torch==2.0.1 |
| 165 | +- transformers==4.30.2 |
| 166 | +hf_cache: |
| 167 | +- repo_id: "NousResearch/Llama-2-7b-chat-hf" |
| 168 | + ignore_patterns: |
| 169 | + - "*.bin" |
| 170 | +resources: |
| 171 | + cpu: "4" |
| 172 | + memory: 30Gi |
| 173 | + use_gpu: True |
| 174 | + accelerator: A10G |
| 175 | +secrets: {} |
| 176 | +``` |
| 177 | + |
| 178 | +```python model/model.py |
| 179 | +from typing import Dict, List |
| 180 | +
|
| 181 | +import torch |
| 182 | +from transformers import LlamaForCausalLM, LlamaTokenizer |
| 183 | +
|
| 184 | +DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant." |
| 185 | +
|
| 186 | +B_INST, E_INST = "[INST]", "[/INST]" |
| 187 | +B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n" |
| 188 | +
|
| 189 | +class Model: |
| 190 | + def __init__(self, **kwargs) -> None: |
| 191 | + self._data_dir = kwargs["data_dir"] |
| 192 | + self._config = kwargs["config"] |
| 193 | + self._secrets = kwargs["secrets"] |
| 194 | + self.model = None |
| 195 | + self.tokenizer = None |
| 196 | +
|
| 197 | + def load(self): |
| 198 | + self.model = LlamaForCausalLM.from_pretrained( |
| 199 | + "meta-llama/Llama-2-7b-chat-hf", |
| 200 | + use_auth_token=self._secrets["hf_access_token"], |
| 201 | + torch_dtype=torch.float16, |
| 202 | + device_map="auto" |
| 203 | + ) |
| 204 | + self.tokenizer = LlamaTokenizer.from_pretrained( |
| 205 | + "meta-llama/Llama-2-7b-chat-hf", |
| 206 | + use_auth_token=self._secrets["hf_access_token"] |
| 207 | + ) |
| 208 | +
|
| 209 | + def predict(self, request: Dict) -> Dict[str, List]: |
| 210 | + prompt = request.pop("prompt") |
| 211 | + inputs = tokenizer(prompt, return_tensors="pt") |
| 212 | +
|
| 213 | + outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100) |
| 214 | + response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] |
| 215 | +
|
| 216 | + return {"response": response} |
| 217 | +
|
| 218 | +def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str: |
| 219 | + return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}" |
| 220 | +``` |
| 221 | + |
| 222 | +</RequestExample> |
0 commit comments