Skip to content

Commit 3db1b78

Browse files
authored
Merge pull request #575 from basetenlabs/main
0.6.1 release
2 parents 24cb02e + 96b37eb commit 3db1b78

34 files changed

+1483
-388
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
## Why Truss?
99

1010
* **Write once, run anywhere:** Package and test model code, weights, and dependencies with a model server that behaves the same in development and production.
11-
* **Fast developer loop:** Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with Truss' done-for-you model serving environment.
11+
* **Fast developer loop:** Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with a batteries-included model serving environment.
1212
* **Support for all Python frameworks**: From `transformers` and `diffusors` to `PyTorch` and `Tensorflow` to `XGBoost` and `sklearn`, Truss supports models created with any framework, even entirely custom models.
1313

1414
See Trusses for popular models including:

docs/examples/models/overview.mdx

+8-8
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
---
2-
title: Example models
3-
description: "Description"
2+
title: Example foundation models
3+
description: "Step-by-step packaging instructions"
44
---
55

66
<CardGroup cols={3}>
77
<Card title="Llama-2" icon="horse" href="/examples/models/llama-2">
8-
Lorem
8+
A commercially-licensed LLM by Meta
99
</Card>
1010
<Card title="Stable Diffusion XL" icon="palette" href="/examples/models/sdxl">
11-
Lorem
11+
A text to image model by Stability AI
1212
</Card>
1313
<Card title="Whisper" icon="ear-listen" href="/examples/models/whisper">
14-
Lorem
14+
An audio transcription model by OpenAI
1515
</Card>
1616
</CardGroup>
1717

18-
<Card title="More" icon="ear-listen" href="#">
19-
Lorem
20-
</Card>
18+
<Card title="More examples on GitHub" icon="github" href="https://github.com/basetenlabs/truss-examples">
19+
See Trusses for dozens of models on GitHub.
20+
</Card>
+219-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,222 @@
11
---
22
title: Load cached model weights
3-
description: "Description"
3+
description: "Deploy a model with private Hugging Face weights"
44
---
5+
6+
In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo.
7+
8+
<Tip>
9+
Bundling model weights can significantly reduce cold start times because your instance won't waste time downloading the model weights from Hugging Face's servers.
10+
</Tip>
11+
12+
We use `Llama-2-7b`, a popular open-source large language model, as an example. In order to follow along with us, you need to request access to Llama 2.
13+
14+
1. First, [sign up for a Hugging Face account](https://huggingface.co/join) if you don't already have one.
15+
2. Request access to Llama 2 from [Meta's website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
16+
2. Next, request access to Llama 2 on [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) by clicking the "Request access" button on the model page.
17+
18+
<Tip>
19+
If you want to deploy on Baseten, you also need to create a Hugging Face API token and add it to your organizations's secrets.
20+
1. [Create a Hugging Face API token](https://huggingface.co/settings/tokens) and copy it to your clipboard.
21+
2. Add the token with the key `hf_access_token` to [your organization's secrets](https://app.baseten.co/settings/secrets) on Baseten.
22+
</Tip>
23+
24+
### Step 0: Initialize Truss
25+
26+
Get started by creating a new Truss:
27+
28+
```sh
29+
truss init llama-2-7b-chat
30+
```
31+
32+
Select the `TrussServer` option then hit `y` to confirm Truss creation. Then navigate to the newly created directory:
33+
34+
```sh
35+
cd llama-2-7b-chat
36+
```
37+
38+
### Step 1: Implement Llama 2 7B in Truss
39+
40+
Next, we'll fill out the `model.py` file to implement Llama 2 7B in Truss.
41+
42+
43+
In `model/model.py`, we write the class `Model` with three member functions:
44+
45+
* `__init__`, which creates an instance of the object with a `_model` property
46+
* `load`, which runs once when the model server is spun up and loads the `pipeline` model
47+
* `predict`, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output.
48+
49+
We will also create a helper function `format_prompt` outside of the `Model` class to appropriately format the incoming text according to the Llama 2 specification.
50+
51+
[Read the quickstart guide](/quickstart) for more details on `Model` class implementation.
52+
53+
```python model/model.py
54+
from typing import Dict, List
55+
56+
import torch
57+
from transformers import LlamaForCausalLM, LlamaTokenizer
58+
59+
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
60+
61+
B_INST, E_INST = "[INST]", "[/INST]"
62+
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
63+
64+
class Model:
65+
def __init__(self, **kwargs) -> None:
66+
self._data_dir = kwargs["data_dir"]
67+
self._config = kwargs["config"]
68+
self._secrets = kwargs["secrets"]
69+
self.model = None
70+
self.tokenizer = None
71+
72+
def load(self):
73+
self.model = LlamaForCausalLM.from_pretrained(
74+
"meta-llama/Llama-2-7b-chat-hf",
75+
use_auth_token=self._secrets["hf_access_token"],
76+
torch_dtype=torch.float16,
77+
device_map="auto"
78+
)
79+
self.tokenizer = LlamaTokenizer.from_pretrained(
80+
"meta-llama/Llama-2-7b-chat-hf",
81+
use_auth_token=self._secrets["hf_access_token"]
82+
)
83+
84+
def predict(self, request: Dict) -> Dict[str, List]:
85+
prompt = request.pop("prompt")
86+
prompt = format_prompt(prompt)
87+
88+
inputs = tokenizer(prompt, return_tensors="pt")
89+
90+
outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
91+
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
92+
93+
return {"response": response}
94+
95+
def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
96+
return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}"
97+
```
98+
99+
### Step 2: Set Python dependencies
100+
101+
Now, we can turn our attention to configuring the model server in `config.yaml`.
102+
103+
In addition to `transformers`, Llama 2 has three other dependencies. We list them below as follows:
104+
105+
```yaml config.yaml
106+
requirements:
107+
- accelerate==0.21.0
108+
- safetensors==0.3.2
109+
- torch==2.0.1
110+
- transformers==4.30.2
111+
```
112+
113+
<Note>
114+
Always pin exact versions for your Python dependencies. The ML/AI space moves fast, so you want to have an up-to-date version of each package while also being protected from breaking changes.
115+
</Note>
116+
117+
### Step 3: Configure Hugging Face caching
118+
119+
Finally, we can configure Hugging Face caching in `config.yaml` by adding the `hf_cache` key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use.
120+
121+
```yaml config.yaml
122+
hf_cache:
123+
- repo_id: "meta-llama/Llama-2-7b-chat-hf"
124+
ignore_patterns:
125+
- "*.bin"
126+
```
127+
128+
In this configuration:
129+
- `meta-llama/Llama-2-7b-chat-hf` is the `repo_id`, pointing to the exact model to cache.
130+
- We use a wild card to ignore all `.bin` files in the model directory by providing a pattern under `ignore_patterns`. This is because the model weights are stored in `.bin` and `safetensors` format, and we only want to cache the `safetensors` files.
131+
132+
133+
### Step 4: Deploy the model
134+
135+
<Note>
136+
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. Make sure you added your `HUGGING_FACE_HUB_TOKEN` to your organization's secrets.
137+
</Note>
138+
139+
We have successfully packaged Llama 2 as a Truss. Let's deploy!
140+
141+
```sh
142+
truss push --trusted
143+
```
144+
145+
### Step 5: Invoke the model
146+
147+
You can invoke the model with:
148+
149+
```sh
150+
truss predict -d '{"prompt": "What is a large language model?"}'
151+
```
152+
153+
<RequestExample>
154+
155+
```yaml config.yaml
156+
environment_variables: {}
157+
external_package_dirs: []
158+
model_metadata: {}
159+
model_name: null
160+
python_version: py39
161+
requirements:
162+
- accelerate==0.21.0
163+
- safetensors==0.3.2
164+
- torch==2.0.1
165+
- transformers==4.30.2
166+
hf_cache:
167+
- repo_id: "NousResearch/Llama-2-7b-chat-hf"
168+
ignore_patterns:
169+
- "*.bin"
170+
resources:
171+
cpu: "4"
172+
memory: 30Gi
173+
use_gpu: True
174+
accelerator: A10G
175+
secrets: {}
176+
```
177+
178+
```python model/model.py
179+
from typing import Dict, List
180+
181+
import torch
182+
from transformers import LlamaForCausalLM, LlamaTokenizer
183+
184+
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
185+
186+
B_INST, E_INST = "[INST]", "[/INST]"
187+
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
188+
189+
class Model:
190+
def __init__(self, **kwargs) -> None:
191+
self._data_dir = kwargs["data_dir"]
192+
self._config = kwargs["config"]
193+
self._secrets = kwargs["secrets"]
194+
self.model = None
195+
self.tokenizer = None
196+
197+
def load(self):
198+
self.model = LlamaForCausalLM.from_pretrained(
199+
"meta-llama/Llama-2-7b-chat-hf",
200+
use_auth_token=self._secrets["hf_access_token"],
201+
torch_dtype=torch.float16,
202+
device_map="auto"
203+
)
204+
self.tokenizer = LlamaTokenizer.from_pretrained(
205+
"meta-llama/Llama-2-7b-chat-hf",
206+
use_auth_token=self._secrets["hf_access_token"]
207+
)
208+
209+
def predict(self, request: Dict) -> Dict[str, List]:
210+
prompt = request.pop("prompt")
211+
inputs = tokenizer(prompt, return_tensors="pt")
212+
213+
outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
214+
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
215+
216+
return {"response": response}
217+
218+
def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
219+
return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}"
220+
```
221+
222+
</RequestExample>
+92-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,94 @@
11
---
2-
title: Serve models with TGI
3-
description: "Description"
2+
title: Serve LLM models with TGI
3+
description: "Deploy a language model using TGI"
44
---
5+
6+
[TGI](https://github.com/huggingface/text-generation-inference/tree/main) is a model server optimized for language models.
7+
8+
<Tip>
9+
You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.
10+
</Tip>
11+
12+
This example will cover:
13+
14+
1. Generating the base Truss
15+
2. Setting sufficient model resources for inference
16+
3. Deploying the model
17+
18+
### Step 1: Generating the base Truss
19+
20+
Get started by creating a new Truss:
21+
22+
```sh
23+
truss init --backend TGI opt125
24+
```
25+
26+
You're going to see a couple of prompts. Follow along with the instructions below:
27+
1. Type `facebook/opt-125M` when prompted for `model`.
28+
2. Press the `tab` key when prompted for `endpoint`. Select the `generate_stream` endpoint.
29+
3. Give your model a name like `OPT-125M`.
30+
31+
Finally, navigate to the directory:
32+
33+
```sh
34+
cd opt125
35+
```
36+
37+
### Step 2: Setting resources and other arguments
38+
39+
You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor.
40+
41+
OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following:
42+
43+
```yaml config.yaml
44+
resources:
45+
accelerator: T4
46+
cpu: "4"
47+
memory: 16Gi
48+
use_gpu: true
49+
```
50+
51+
Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py).
52+
53+
### Step 3: Deploy the model
54+
55+
<Note>
56+
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step.
57+
</Note>
58+
59+
Let's deploy our OPT-125M vLLM model.
60+
61+
```sh
62+
truss push
63+
```
64+
65+
You can invoke the model with:
66+
67+
```sh
68+
truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128}}'
69+
```
70+
71+
<RequestExample>
72+
73+
```yaml config.yaml
74+
build:
75+
arguments:
76+
endpoint: generate_stream
77+
model: facebook/opt-125M
78+
model_server: TGI
79+
environment_variables: {}
80+
external_package_dirs: []
81+
model_metadata: {}
82+
model_name: OPT-125M
83+
python_version: py39
84+
requirements: []
85+
resources:
86+
accelerator: T4
87+
cpu: "4"
88+
memory: 16Gi
89+
use_gpu: true
90+
secrets: {}
91+
system_packages: []
92+
```
93+
94+
</RequestExample>

0 commit comments

Comments
 (0)