Skip to content

Commit 6a65458

Browse files
authored
Merge pull request #696 from basetenlabs/bump-version-0.7.11
Release 0.7.11
2 parents b906180 + 3aa8c77 commit 6a65458

File tree

12 files changed

+355
-40
lines changed

12 files changed

+355
-40
lines changed

docs/examples/performance/cached-weights.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Load cached model weights
3-
description: "Deploy a model with private Hugging Face weights"
2+
title: Deploy Llama 2 with Caching
3+
description: "Enable fast cold starts for a model with private Hugging Face weights"
44
---
55

66
In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo.

docs/guides/model-cache.mdx

+119
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: Caching model weights
3+
description: "Accelerate cold starts by caching your weights"
4+
---
5+
6+
Truss natively supports automatic caching for model weights. This is a simple yet effective strategy to enhance deployment speed and operational efficiency when it comes to cold starts and scaling beyond a single replica.
7+
8+
<Tip>
9+
### What is a "cold start"?
10+
11+
"Cold start" is a term used to refer to the time taken by a model to boot up after being idle. This process can become a critical factor in serverless environments, as it can significantly influence the model response time, customer satisfaction, and cost.
12+
13+
Without caching our model's weights, we would need to download weights every time we scale up. Caching model weights circumvents this download process. When our new instance boots up, the server automatically finds the cached weights and can proceed with starting up the endpoint.
14+
15+
In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.
16+
17+
</Tip>
18+
19+
## Enabling Caching for a Model
20+
21+
To enable caching, simply add `hf_cache` to your `config.yml` with a valid `repo_id`. The `hf_cache` has a few key configurations:
22+
- `repo_id` (required): The endpoint for your cloud bucket. Currently, we support Hugging Face and Google Cloud Storage.
23+
- `revision`: Points to your revision. This is only relevant if you are pulling By default, it refers to `main`.
24+
- `allow_patterns`: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.
25+
- `ignore_patterns`: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.
26+
27+
Here is an example of a well written `hf_cache` for Stable Diffusion XL. Note how it only pulls the model weights that it needs using `allow_patterns`.
28+
29+
```yaml config.yml
30+
...
31+
hf_cache:
32+
- repo_id: madebyollin/sdxl-vae-fp16-fix
33+
allow_patterns:
34+
- config.json
35+
- diffusion_pytorch_model.safetensors
36+
- repo_id: stabilityai/stable-diffusion-xl-base-1.0
37+
allow_patterns:
38+
- "*.json"
39+
- "*.fp16.safetensors"
40+
- sd_xl_base_1.0.safetensors
41+
- repo_id: stabilityai/stable-diffusion-xl-refiner-1.0
42+
allow_patterns:
43+
- "*.json"
44+
- "*.fp16.safetensors"
45+
- sd_xl_refiner_1.0.safetensors
46+
...
47+
```
48+
49+
Many Hugging Face repos have model weights in different formats (`.bin`, `.safetensors`, `.h5`, `.msgpack`, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.
50+
51+
There are also some additional steps depending on the cloud bucket you want to query.
52+
53+
### Hugging Face 🤗
54+
For any public Hugging Face repo, you don't need to do anything else. Adding the `hf_cache` key with an appropriate `repo_id` should be enough.
55+
56+
However, if you want to deploy a model from a gated repo like [Llama 2](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) to Baseten, there's a few steps you need to take:
57+
<Steps>
58+
<Step title="Get Hugging Face API Key">
59+
[Grab an API key](https://huggingface.co/settings/tokens) from Hugging Face with `read` access. Make sure you have access to the model you want to serve.
60+
</Step>
61+
<Step title="Add it to Baseten Secrets Manager">
62+
Paste your API key in your [secrets manager in Baseten](https://app.baseten.co/settings/secrets) under the key `hf_access_token`. You can read more about secrets [here](https://truss.baseten.co/guides/secrets).
63+
</Step>
64+
<Step title="Update Config">
65+
In your Truss's `config.yml`, add the following code:
66+
67+
```yaml config.yml
68+
...
69+
secrets:
70+
hf_access_token: null
71+
...
72+
```
73+
74+
Make sure that the key `secrets` only shows up once in your `config.yml`.
75+
</Step>
76+
</Steps>
77+
78+
If you run into any issues, run through all the steps above again and make sure you did not misspell the name of the repo or paste an incorrect API key.
79+
80+
<Tip>
81+
Weights will be cached in the default Hugging Face cache directory, `~/.cache/huggingface/hub/models--{your_model_name}/`. You can change this directory by setting the `HF_HOME` or `HUGGINGFACE_HUB_CACHE` environment variable in your `config.yml`.
82+
83+
[Read more here](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
84+
</Tip>
85+
86+
### Google Cloud Storage
87+
Google Cloud Storage is a great alternative to Hugging Face when you have a custom model or fine-tune you want to gate, especially if you are already using GCP and care about security and compliance.
88+
89+
Your `hf_cache` should look something like this:
90+
91+
```yaml config.yml
92+
...
93+
hf_cache:
94+
repo_id: gcs://path-to-my-bucket
95+
...
96+
```
97+
98+
For a private GCS bucket, first export your service account key. Rename it to be `service_account.json` and add it to the `data` directory of your Truss.
99+
100+
Your file structure should look something like this:
101+
102+
```
103+
your-truss
104+
|--model
105+
| └── model.py
106+
|--data
107+
|. └── service_account.json
108+
```
109+
110+
<Warning>
111+
If you are using version control, like git, for your Truss, make sure to add `service_account.json` to your `.gitignore` file. You don't want to accidentally expose your service account key.
112+
</Warning>
113+
114+
Weights will be cached at `/app/hf_cache/{your_bucket_name}`.
115+
116+
117+
### Other Buckets
118+
119+
We're currently workign on adding support for more bucket types, including AWS S3. If you have any suggestions, please [leave an issue](https://github.com/basetenlabs/truss/issues) on our GitHub repo.

docs/guides/tgi.mdx

+139
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
---
2+
title: "High performance model serving with TGI"
3+
description: "A guide to using TGI for your model"
4+
---
5+
6+
[TGI](https://huggingface.co/text-generation-inference), or Text Generation Inference, is a high performance model server for language models built by Huggingface. In this doc, we'll cover when to use TGI, how to tune TGI for performance and how to deploy it to Baseten.
7+
8+
# What is TGI?
9+
10+
TGI consists of 2 parts:
11+
12+
1. A high performance, Rust-based server
13+
2. A set of optimized model implementations that outperform generic implementations
14+
15+
To optimize GPU utilization and improve throughput & latency, it's common to implement optimizations at the server level and the model level.
16+
17+
It's worth noting that TGI can only be used for a certain subset of models. As of September 2023, this includes:
18+
19+
- Mistral 7B
20+
- Llama V2
21+
- Llama
22+
- MPT
23+
- Code Llama
24+
- Falcon 40B
25+
- Falcon 7B
26+
- FLAN-T5
27+
- BLOOM
28+
- Galactica
29+
- GPT-Neox
30+
- OPT
31+
- SantaCoder
32+
- Starcoder
33+
34+
You should use TGI if you care about model latency / throughput and want to be able stream results. It's worth nothing that tuning TGI to achieve max performance requires a dedicated effort and is different for every model and what metric you'd like to optimize for.
35+
36+
# How to use TGI
37+
38+
To define a TGI truss, we'll use the truss CLI to generate the scaffold. Run the following in your terminal
39+
40+
```
41+
truss init ./my-tgi-truss --backend TGI
42+
```
43+
44+
Let's walk through what this command is doing
45+
```
46+
- "truss init": This CLI command initializes a empty scaffold for a variety of model backends that truss supports (TGI, VLLM, Triton and TrussServer)
47+
- "./my-tgi-truss": The `init` command requires a `target_directory` which is where this empty scaffold will be generated
48+
- "--backend TGI": This tells truss that we are specifically interested in a TGI truss, which instantiates a different scaffold than the default (TrussServer)
49+
```
50+
51+
After running that command, you'll be prompted to pass a `model_id`.
52+
```
53+
$ truss init ./my-tgi-trus --backend TGI
54+
? model_id
55+
```
56+
57+
One thing that makes Truss support in TGI a bit different from the default is that there isn't any model instantiation or model inference code that needs to be written! Because we leverage TGI under the hood, TGI handles loading the model in and running inference. All you need to do is pass in the model ID as it appears on Huggingface.
58+
59+
After typing in your model ID, you'll be prompted with the `endpoint` you'd like to use.
60+
```
61+
$ truss init ./my-tgi-trus --backend TGI
62+
? model_id facebook/opt-125M
63+
? endpoint
64+
```
65+
66+
TGI supports 2 endpoints:
67+
- `generate` which returns the entire generated response upon completion
68+
- `generate_stream` which streams the response as it's being generated
69+
70+
71+
You can press the `tab` key on any of these dialogues to see options for values.
72+
73+
74+
Finally, you'll be asked for the name of your model.
75+
76+
# Deploying your TGI model
77+
78+
Now that we have a TGI model, let's deploy this model to Baseten and see how it performs.
79+
80+
You'll need an API key to deploy your model. You can get one by navigating to your Baseten settings [page](https://app.baseten.co/settings/account/api_keys). To push the model to Baseten, run the following command:
81+
```
82+
$ truss push --publish
83+
```
84+
85+
Let's walk through what this command is doing
86+
```
87+
- "truss push": This CLI command zips and uploads your Truss to Baseten
88+
- "--publish": This tells Baseten that you want a Production endpoint (vs. a development environment)
89+
```
90+
91+
After running this command, you'll be prompted to pass in your API key. Once you've done that, you can view your deployment status from the Baseten dashboard. Once the model has been deployed, you can invoke it by running the following command:
92+
93+
```
94+
$ truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}} --published'
95+
```
96+
Let's walk through what this command is doing
97+
```
98+
- "truss predict": This CLI command invokes your model
99+
- "-d": This tells the CLI that the next argument is the data you want to pass to your model
100+
- '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}}': This is the data you want to pass to your model. TGI expects a JSON object with 2 keys: `inputs` and `parameters` (optional). `inputs` is the prompt you want to pass to your model and `parameters` is a JSON object of parameters you want to pass to your model.
101+
- "--published": This tells the CLI that you want to invoke the production endpoint (vs. a development endpoint)
102+
```
103+
104+
# Tuning your TGI server
105+
106+
After deploying your model, you may notice that you're not getting the performance you'd like out of TGI. If you navigate to the `target_directory` from above, you'll find a `config.yaml` file that contains a key `build`. The following is a set of arguments you can pass to the `build` key to tune TGI for max performance
107+
108+
1. `max_input_length` (default: 1024)
109+
This parameter represents the maximum allowed input length expressed in the number of tokens.
110+
111+
If you expect longer sequences as input, you might need to adjust this parameter accordingly, keeping in mind that longer sequences may impact the overall memory required to handle the load. If you know that you'll send less than the max length allowed by the model, it's advantageous to set this parameter. When TGI starts up, it calculates how much memory to reserve per request and setting this value allows TGI to accurately allocate the correct amount of memory. In short, higher values take up more GPU memory but afford you the ability to send longer prompts.
112+
113+
2. `max_total_tokens` (default: 2048)
114+
115+
This is a crucial parameter defining the "memory budget" of all requests. It is the total number of input tokens and max_new_tokens across all requests that TGI can handle at a given time. For example, with a value of 2048, users can send either a prompt of 1000 and ask for 1048 new tokens, or send a prompt of 1 and ask for 2047 max_new_tokens.
116+
117+
If you encounter memory limitations or need to optimize memory usage based on client requirements, adjusting this parameter could help in efficiently managing the memory usage. Higher values take up more GPU memory but allow you longer (prompts + generated text). The tradeoff here is that higher values will increase throughput but will increase individual request latency.
118+
119+
3. `max_batch_prefill_tokens` (default: 4096)
120+
121+
TGI splits up the generation process into 2 phases: prefill and generation. The prefill phase reads in the input tokens and generates the first output token. This parameter controls, in a given batch of requests, how many tokens can be processed at once during the prefill phase. This value should be __at least__ the value for `max_input_length`.
122+
123+
Similar to max input length, if your input tokens are constrained, this is worth setting as a function of (constrained input length) * (max batch size your hardware can handle). This setting is also worth defining when you want to impose stricter controls on the resource usage during prefill operations, especially when dealing with models having a large footprint or under constrained hardware environments.
124+
125+
4. `max_batch_total_tokens`
126+
127+
Similar to `max_batch_prefill_tokens`, this represents the entire token count across a batch; the total input tokens + the total generated tokens. In short, this value should be the top end of number of tokens that can fit on the GPU after the model has been loaded.
128+
129+
This value is particularly important when maximizing GPU utilization. The tradeoff here is that higher values will increase throughput but will increase individual request latency.
130+
131+
6. `max_waiting_tokens`
132+
133+
This setting defines how many tokens can be passed before forcing the waiting queries to be put on the batch. Adjusting this value will help in optimizing the overall latency for end-users by managing how quickly waiting queries are allowed a slot in the running batch.
134+
135+
Tune this parameter when optimizing for end-user latency and to manage the compute allocation between prefill and decode operations efficiently. The tradeoff here is that higher values will increase throughput but will increase individual request latency because of the additional time it takes to fill up the batch.
136+
137+
7. `sharded`
138+
139+
This setting defines whether to shard the model across multiple GPUs. Use this when deploying on multi-GPU setups and want to improve the resource utilization and potentially the throughput of the server. You'd want to disable this if you're loading other models on another GPU.

docs/mint.json

+10-8
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,16 @@
5252
"usage"
5353
]
5454
},
55+
{
56+
"group": "Guides",
57+
"pages": [
58+
"guides/secrets",
59+
"guides/base-images",
60+
"guides/model-cache",
61+
"guides/concurrency",
62+
"guides/tgi"
63+
]
64+
},
5565
{
5666
"group": "Examples",
5767
"pages": [
@@ -64,14 +74,6 @@
6474
"examples/performance/vllm-server"
6575
]
6676
},
67-
{
68-
"group": "Guides",
69-
"pages": [
70-
"guides/secrets",
71-
"guides/base-images",
72-
"guides/concurrency"
73-
]
74-
},
7577
{
7678
"group": "Remotes",
7779
"pages": [

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "truss"
3-
version = "0.7.9"
3+
version = "0.7.11"
44
description = "A seamless bridge from model development to model delivery"
55
license = "MIT"
66
readme = "README.md"

truss/cli/cli.py

+9-19
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,13 @@
33
import logging
44
import os
55
import sys
6-
import webbrowser
76
from functools import wraps
87
from pathlib import Path
98
from typing import Callable, Optional
109

1110
import rich
1211
import rich_click as click
1312
import truss
14-
from InquirerPy import inquirer
1513
from truss.cli.create import ask_name, select_server_backend
1614
from truss.remote.baseten.core import (
1715
ModelId,
@@ -228,15 +226,10 @@ def watch(
228226
)
229227
sys.exit(1)
230228

231-
logs_url = remote_provider.get_remote_logs_url(model_name) # type: ignore[attr-defined]
229+
service = remote_provider.get_service(model_identifier=ModelName(model_name))
230+
logs_url = remote_provider.get_remote_logs_url(service)
232231
rich.print(f"🪵 View logs for your deployment at {logs_url}")
233-
if not logs:
234-
logs = inquirer.confirm(
235-
message="🗂 Open logs in a new tab?", default=True
236-
).execute()
237-
if logs:
238-
webbrowser.open_new_tab(logs_url)
239-
remote_provider.sync_truss_to_dev_version_by_name(model_name, target_directory) # type: ignore
232+
remote_provider.sync_truss_to_dev_version_by_name(model_name, target_directory)
240233

241234

242235
def _extract_and_validate_model_identifier(
@@ -349,7 +342,9 @@ def predict(
349342

350343
request_data = _extract_request_data(data=data, file=file)
351344

352-
service = remote_provider.get_baseten_service(model_identifier, published) # type: ignore
345+
service = remote_provider.get_service(
346+
model_identifier=model_identifier, published=published
347+
)
353348
result = service.predict(request_data)
354349
if inspect.isgenerator(result):
355350
for chunk in result:
@@ -414,11 +409,11 @@ def push(
414409
tr.spec.config.write_to_yaml_file(tr.spec.config_path, verbose=False)
415410

416411
# TODO(Abu): This needs to be refactored to be more generic
417-
_ = remote_provider.push(tr, model_name, publish=publish, trusted=trusted) # type: ignore
412+
service = remote_provider.push(tr, model_name, publish=publish, trusted=trusted) # type: ignore
418413

419414
click.echo(f"✨ Model {model_name} was successfully pushed ✨")
420415

421-
if not publish:
416+
if service.is_draft:
422417
draft_model_text = """
423418
|---------------------------------------------------------------------------------------|
424419
| Your model has been deployed as a draft. Draft models allow you to |
@@ -433,13 +428,8 @@ def push(
433428

434429
click.echo(draft_model_text)
435430

436-
logs_url = remote_provider.get_remote_logs_url(model_name, publish) # type: ignore[attr-defined]
431+
logs_url = remote_provider.get_remote_logs_url(service) # type: ignore[attr-defined]
437432
rich.print(f"🪵 View logs for your deployment at {logs_url}")
438-
should_open_logs = inquirer.confirm(
439-
message="🗂 Open logs in a new tab?", default=True
440-
).execute()
441-
if should_open_logs:
442-
webbrowser.open_new_tab(logs_url)
443433

444434

445435
@truss_cli.command()

0 commit comments

Comments
 (0)