Skip to content

Commit fd9b905

Browse files
Philip/even more docs (#563)
* tutorial refactor * CLI reference * usage * user guide * fix up 3 tutorials * 2 more tutorials * VLLM tutorial
1 parent 5f0fc96 commit fd9b905

31 files changed

+1382
-383
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
## Why Truss?
99

1010
* **Write once, run anywhere:** Package and test model code, weights, and dependencies with a model server that behaves the same in development and production.
11-
* **Fast developer loop:** Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with Truss' done-for-you model serving environment.
11+
* **Fast developer loop:** Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with a batteries-included model serving environment.
1212
* **Support for all Python frameworks**: From `transformers` and `diffusors` to `PyTorch` and `Tensorflow` to `XGBoost` and `sklearn`, Truss supports models created with any framework, even entirely custom models.
1313

1414
See Trusses for popular models including:

docs/examples/models/overview.mdx

+8-8
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
---
2-
title: Example models
3-
description: "Description"
2+
title: Example foundation models
3+
description: "Step-by-step packaging instructions"
44
---
55

66
<CardGroup cols={3}>
77
<Card title="Llama-2" icon="horse" href="/examples/models/llama-2">
8-
Lorem
8+
A commercially-licensed LLM by Meta
99
</Card>
1010
<Card title="Stable Diffusion XL" icon="palette" href="/examples/models/sdxl">
11-
Lorem
11+
A text to image model by Stability AI
1212
</Card>
1313
<Card title="Whisper" icon="ear-listen" href="/examples/models/whisper">
14-
Lorem
14+
An audio transcription model by OpenAI
1515
</Card>
1616
</CardGroup>
1717

18-
<Card title="More" icon="ear-listen" href="#">
19-
Lorem
20-
</Card>
18+
<Card title="More examples on GitHub" icon="github" href="https://github.com/basetenlabs/truss-examples">
19+
See Trusses for dozens of models on GitHub.
20+
</Card>
+219-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,222 @@
11
---
22
title: Load cached model weights
3-
description: "Description"
3+
description: "Deploy a model with private Hugging Face weights"
44
---
5+
6+
In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo.
7+
8+
<Tip>
9+
Bundling model weights can significantly reduce cold start times because your instance won't waste time downloading the model weights from Hugging Face's servers.
10+
</Tip>
11+
12+
We use `Llama-2-7b`, a popular open-source large language model, as an example. In order to follow along with us, you need to request access to Llama 2.
13+
14+
1. First, [sign up for a Hugging Face account](https://huggingface.co/join) if you don't already have one.
15+
2. Request access to Llama 2 from [Meta's website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
16+
2. Next, request access to Llama 2 on [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) by clicking the "Request access" button on the model page.
17+
18+
<Tip>
19+
If you want to deploy on Baseten, you also need to create a Hugging Face API token and add it to your organizations's secrets.
20+
1. [Create a Hugging Face API token](https://huggingface.co/settings/tokens) and copy it to your clipboard.
21+
2. Add the token with the key `hf_access_token` to [your organization's secrets](https://app.baseten.co/settings/secrets) on Baseten.
22+
</Tip>
23+
24+
### Step 0: Initialize Truss
25+
26+
Get started by creating a new Truss:
27+
28+
```sh
29+
truss init llama-2-7b-chat
30+
```
31+
32+
Select the `TrussServer` option then hit `y` to confirm Truss creation. Then navigate to the newly created directory:
33+
34+
```sh
35+
cd llama-2-7b-chat
36+
```
37+
38+
### Step 1: Implement Llama 2 7B in Truss
39+
40+
Next, we'll fill out the `model.py` file to implement Llama 2 7B in Truss.
41+
42+
43+
In `model/model.py`, we write the class `Model` with three member functions:
44+
45+
* `__init__`, which creates an instance of the object with a `_model` property
46+
* `load`, which runs once when the model server is spun up and loads the `pipeline` model
47+
* `predict`, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output.
48+
49+
We will also create a helper function `format_prompt` outside of the `Model` class to appropriately format the incoming text according to the Llama 2 specification.
50+
51+
[Read the quickstart guide](/quickstart) for more details on `Model` class implementation.
52+
53+
```python model/model.py
54+
from typing import Dict, List
55+
56+
import torch
57+
from transformers import LlamaForCausalLM, LlamaTokenizer
58+
59+
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
60+
61+
B_INST, E_INST = "[INST]", "[/INST]"
62+
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
63+
64+
class Model:
65+
def __init__(self, **kwargs) -> None:
66+
self._data_dir = kwargs["data_dir"]
67+
self._config = kwargs["config"]
68+
self._secrets = kwargs["secrets"]
69+
self.model = None
70+
self.tokenizer = None
71+
72+
def load(self):
73+
self.model = LlamaForCausalLM.from_pretrained(
74+
"meta-llama/Llama-2-7b-chat-hf",
75+
use_auth_token=self._secrets["hf_access_token"],
76+
torch_dtype=torch.float16,
77+
device_map="auto"
78+
)
79+
self.tokenizer = LlamaTokenizer.from_pretrained(
80+
"meta-llama/Llama-2-7b-chat-hf",
81+
use_auth_token=self._secrets["hf_access_token"]
82+
)
83+
84+
def predict(self, request: Dict) -> Dict[str, List]:
85+
prompt = request.pop("prompt")
86+
prompt = format_prompt(prompt)
87+
88+
inputs = tokenizer(prompt, return_tensors="pt")
89+
90+
outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
91+
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
92+
93+
return {"response": response}
94+
95+
def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
96+
return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}"
97+
```
98+
99+
### Step 2: Set Python dependencies
100+
101+
Now, we can turn our attention to configuring the model server in `config.yaml`.
102+
103+
In addition to `transformers`, Llama 2 has three other dependencies. We list them below as follows:
104+
105+
```yaml config.yaml
106+
requirements:
107+
- accelerate==0.21.0
108+
- safetensors==0.3.2
109+
- torch==2.0.1
110+
- transformers==4.30.2
111+
```
112+
113+
<Note>
114+
Always pin exact versions for your Python dependencies. The ML/AI space moves fast, so you want to have an up-to-date version of each package while also being protected from breaking changes.
115+
</Note>
116+
117+
### Step 3: Configure Hugging Face caching
118+
119+
Finally, we can configure Hugging Face caching in `config.yaml` by adding the `hf_cache` key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use.
120+
121+
```yaml config.yaml
122+
hf_cache:
123+
- repo_id: "meta-llama/Llama-2-7b-chat-hf"
124+
ignore_patterns:
125+
- "*.bin"
126+
```
127+
128+
In this configuration:
129+
- `meta-llama/Llama-2-7b-chat-hf` is the `repo_id`, pointing to the exact model to cache.
130+
- We use a wild card to ignore all `.bin` files in the model directory by providing a pattern under `ignore_patterns`. This is because the model weights are stored in `.bin` and `safetensors` format, and we only want to cache the `safetensors` files.
131+
132+
133+
### Step 4: Deploy the model
134+
135+
<Note>
136+
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. Make sure you added your `HUGGING_FACE_HUB_TOKEN` to your organization's secrets.
137+
</Note>
138+
139+
We have successfully packaged Llama 2 as a Truss. Let's deploy!
140+
141+
```sh
142+
truss push --trusted
143+
```
144+
145+
### Step 5: Invoke the model
146+
147+
You can invoke the model with:
148+
149+
```sh
150+
truss predict -d '{"prompt": "What is a large language model?"}'
151+
```
152+
153+
<RequestExample>
154+
155+
```yaml config.yaml
156+
environment_variables: {}
157+
external_package_dirs: []
158+
model_metadata: {}
159+
model_name: null
160+
python_version: py39
161+
requirements:
162+
- accelerate==0.21.0
163+
- safetensors==0.3.2
164+
- torch==2.0.1
165+
- transformers==4.30.2
166+
hf_cache:
167+
- repo_id: "NousResearch/Llama-2-7b-chat-hf"
168+
ignore_patterns:
169+
- "*.bin"
170+
resources:
171+
cpu: "4"
172+
memory: 30Gi
173+
use_gpu: True
174+
accelerator: A10G
175+
secrets: {}
176+
```
177+
178+
```python model/model.py
179+
from typing import Dict, List
180+
181+
import torch
182+
from transformers import LlamaForCausalLM, LlamaTokenizer
183+
184+
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
185+
186+
B_INST, E_INST = "[INST]", "[/INST]"
187+
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
188+
189+
class Model:
190+
def __init__(self, **kwargs) -> None:
191+
self._data_dir = kwargs["data_dir"]
192+
self._config = kwargs["config"]
193+
self._secrets = kwargs["secrets"]
194+
self.model = None
195+
self.tokenizer = None
196+
197+
def load(self):
198+
self.model = LlamaForCausalLM.from_pretrained(
199+
"meta-llama/Llama-2-7b-chat-hf",
200+
use_auth_token=self._secrets["hf_access_token"],
201+
torch_dtype=torch.float16,
202+
device_map="auto"
203+
)
204+
self.tokenizer = LlamaTokenizer.from_pretrained(
205+
"meta-llama/Llama-2-7b-chat-hf",
206+
use_auth_token=self._secrets["hf_access_token"]
207+
)
208+
209+
def predict(self, request: Dict) -> Dict[str, List]:
210+
prompt = request.pop("prompt")
211+
inputs = tokenizer(prompt, return_tensors="pt")
212+
213+
outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
214+
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
215+
216+
return {"response": response}
217+
218+
def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
219+
return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}"
220+
```
221+
222+
</RequestExample>
+95-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,98 @@
11
---
22
title: Serve models with VLLM
3-
description: "Description"
3+
description: "Deploy a language model using vLLM"
44
---
5+
6+
[vLLM](https://github.com/vllm-project/vllm) is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease. We're going to walk through deploying a vLLM-optimized [OPT-125M model](https://huggingface.co/facebook/opt-125m).
7+
8+
<Tip>
9+
You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.
10+
</Tip>
11+
12+
This example will cover:
13+
14+
1. Generating the base Truss
15+
2. Setting sufficient model resources for inference
16+
3. Deploying the model
17+
18+
### Step 1: Generating the base Truss
19+
20+
Get started by creating a new Truss:
21+
22+
```sh
23+
truss init opt125
24+
```
25+
26+
You're going to see a couple of prompts. Follow along with the instructions below:
27+
1. Type `facebook/opt-125M` when prompted for `model`.
28+
2. Press the `tab` key when prompted for `endpoint`. Select the `Completions` endpoint.
29+
3. Give your model a name like `OPT-125M`.
30+
31+
<Note>
32+
The underlying server that we use is OpenAI compatible. If you plan on using the model as a chat model, then select `ChatCompletion`. OPT-125M is not a chat model so we selected `Completion`.
33+
</Note>
34+
35+
Finally, navigate to the directory:
36+
37+
```sh
38+
cd opt125
39+
```
40+
41+
### Step 2: Setting resources and other arguments
42+
43+
You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor.
44+
45+
OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following:
46+
47+
```yaml config.yaml
48+
resources:
49+
accelerator: T4
50+
cpu: "4"
51+
memory: 16Gi
52+
use_gpu: true
53+
```
54+
55+
Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py).
56+
57+
### Step 3: Deploy the model
58+
59+
<Note>
60+
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step.
61+
</Note>
62+
63+
Let's deploy our OPT-125M vLLM model.
64+
65+
```sh
66+
truss push
67+
```
68+
69+
You can invoke the model with:
70+
71+
```sh
72+
truss predict -d '{"prompt": "What is a large language model?"}'
73+
```
74+
75+
<RequestExample>
76+
77+
```yaml config.yaml
78+
build:
79+
arguments:
80+
endpoint: Completions
81+
model: facebook/opt-125M
82+
model_server: VLLM
83+
environment_variables: {}
84+
external_package_dirs: []
85+
model_metadata: {}
86+
model_name: OPT-125M
87+
python_version: py39
88+
requirements: []
89+
resources:
90+
accelerator: T4
91+
cpu: "4"
92+
memory: 16Gi
93+
use_gpu: true
94+
secrets: {}
95+
system_packages: []
96+
```
97+
98+
</RequestExample>

0 commit comments

Comments
 (0)