Skip to content

Commit 1dc9be5

Browse files
author
Varun Shenoy
authored
Cache Refactor and Improvements (#710)
* init s3 caching * update toml to test on dev * fix gcs tests + add s3 tests * cleanup * add boto to deps * update pyproject to include boto * bump dev * update poetry lock * public s3 buckets are working * update dev * bump rc * refactored cache warmer * public s3 + code refactors * can pull from public GCS now * clean up * filter by path for buckets * add model_cache aliasing * new version * convert hf_cache to model_cache * add info callout on model_cache * add warning on hf_cache key * add warning on hf_cache key * add warning on hf_cache key * added some changes, still need to add typing * renamed static file * clean up constants * abstract out cache files * clean up strings * refactor serving image builder * added fixes * refactor credentials * add typing * add typing * bump dev * fixed credentials_to_cache, time to retest * bump toml
1 parent 97e816a commit 1dc9be5

25 files changed

+626
-314
lines changed

docs/_snippets/config-params.mdx

+7-7
Original file line numberDiff line numberDiff line change
@@ -231,36 +231,36 @@ Either `VLLM` for vLLM, or `TGI` for TGI.
231231
The arguments for the model server. This includes information such as which model you intend to load, and
232232
which endpoin from the server you'd like to use.
233233

234-
### `hf_cache`
234+
### `model_cache`
235235

236-
The `hf_cache` section is used for caching model weights at build-time. This is one of the biggest levers
236+
The `model_cache` section is used for caching model weights at build-time. This is one of the biggest levers
237237
for decreasing cold start times, as downloading weights can be one of the lengthiest parts of starting a new
238238
model instance. Using this section ensures that model weights are cached at _build_ time.
239239

240240
See the [model cache guide](guides/model-cache) for the full details on how to use this field.
241241

242242
<Note>
243-
Despite the fact that this field is called the `hf_cache`, there are multiple backends supported, not just Hugging Face. You can
243+
Despite the fact that this field is called the `model_cache`, there are multiple backends supported, not just Hugging Face. You can
244244
also cache weights stored on GCS, for instance.
245245
</Note>
246246

247-
#### `hf_cache.<list_item>.repo_id`
247+
#### `model_cache.<list_item>.repo_id`
248248

249249
The endpoint for your cloud bucket. Currently, we support Hugging Face and Google Cloud Storage.
250250

251251
Example: `madebyollin/sdxl-vae-fp16-fix` for a Hugging Face repo, or `gcs://path-to-my-bucket` for
252252
a GCS bucket.
253253

254-
#### `hf_cache.<list_item>.revision`
254+
#### `model_cache.<list_item>.revision`
255255

256256
Points to your revision. This is only relevant if you are pulling By default, it refers to `main`.
257257

258-
#### `hf_cache.<list_item>.allow_patterns`
258+
#### `model_cache.<list_item>.allow_patterns`
259259

260260
Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.
261261
By default, all paths are included.
262262

263-
#### `hf_cache.<list_item>.ignore_patterns`
263+
#### `model_cache.<list_item>.ignore_patterns`
264264

265265
Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.
266266
By default, nothing is ignored.

docs/examples/04-image-generation.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ subsequently.
215215

216216
To enable caching, add the following to the config:
217217
```yaml
218-
hf_cache:
218+
model_cache:
219219
- repo_id: madebyollin/sdxl-vae-fp16-fix
220220
allow_patterns:
221221
- config.json

docs/examples/06-high-performance-cached-weights.mdx

+4-4
Original file line numberDiff line numberDiff line change
@@ -89,9 +89,9 @@ requirements:
8989
- sentencepiece==0.1.99
9090
- protobuf==4.24.4
9191
```
92-
# Configuring the hf_cache
92+
# Configuring the model_cache
9393
94-
To cache model weights, set the `hf_cache` key.
94+
To cache model weights, set the `model_cache` key.
9595
The `repo_id` field allows you to specify a Huggingface
9696
repo to pull down and cache at build-time, and the `ignore_patterns`
9797
field allows you to specify files to ignore. If this is specified, then
@@ -100,7 +100,7 @@ this repo won't have to be pulled during runtime.
100100
Check out the [guide](https://truss.baseten.co/guides/model-cache) for more info.
101101

102102
```yaml config.yaml
103-
hf_cache:
103+
model_cache:
104104
- repo_id: "NousResearch/Llama-2-7b-chat-hf"
105105
ignore_patterns:
106106
- "*.bin"
@@ -197,7 +197,7 @@ requirements:
197197
- transformers==4.34.0
198198
- sentencepiece==0.1.99
199199
- protobuf==4.24.4
200-
hf_cache:
200+
model_cache:
201201
- repo_id: "NousResearch/Llama-2-7b-chat-hf"
202202
ignore_patterns:
203203
- "*.bin"

docs/examples/performance/cached-weights.mdx

+4-4
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Deploy Llama 2 with Caching
33
description: "Enable fast cold starts for a model with private Hugging Face weights"
44
---
55

6-
In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo.
6+
In this example, we will cover how you can use the `model_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo.
77

88
<Tip>
99
Bundling model weights can significantly reduce cold start times because your instance won't waste time downloading the model weights from Hugging Face's servers.
@@ -116,10 +116,10 @@ Always pin exact versions for your Python dependencies. The ML/AI space moves fa
116116
117117
### Step 3: Configure Hugging Face caching
118118
119-
Finally, we can configure Hugging Face caching in `config.yaml` by adding the `hf_cache` key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use.
119+
Finally, we can configure Hugging Face caching in `config.yaml` by adding the `model_cache` key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use.
120120

121121
```yaml config.yaml
122-
hf_cache:
122+
model_cache:
123123
- repo_id: "meta-llama/Llama-2-7b-chat-hf"
124124
ignore_patterns:
125125
- "*.bin"
@@ -163,7 +163,7 @@ requirements:
163163
- safetensors==0.3.2
164164
- torch==2.0.1
165165
- transformers==4.30.2
166-
hf_cache:
166+
model_cache:
167167
- repo_id: "NousResearch/Llama-2-7b-chat-hf"
168168
ignore_patterns:
169169
- "*.bin"

docs/guides/model-cache.mdx

+57-8
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,20 @@ In practice, this reduces the cold start for large models to just a few seconds.
1818

1919
## Enabling Caching for a Model
2020

21-
To enable caching, simply add `hf_cache` to your `config.yml` with a valid `repo_id`. The `hf_cache` has a few key configurations:
21+
To enable caching, simply add `model_cache` to your `config.yml` with a valid `repo_id`. The `model_cache` has a few key configurations:
2222
- `repo_id` (required): The endpoint for your cloud bucket. Currently, we support Hugging Face and Google Cloud Storage.
2323
- `revision`: Points to your revision. This is only relevant if you are pulling By default, it refers to `main`.
2424
- `allow_patterns`: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.
2525
- `ignore_patterns`: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.
2626

27-
Here is an example of a well written `hf_cache` for Stable Diffusion XL. Note how it only pulls the model weights that it needs using `allow_patterns`.
27+
<Info>We recently renamed `hf_cache` to `model_cache`, but don't worry! If you're using `hf_cache` in any of your projects, it will automatically be aliased to `model_cache`.</Info>
28+
29+
30+
Here is an example of a well written `model_cache` for Stable Diffusion XL. Note how it only pulls the model weights that it needs using `allow_patterns`.
2831

2932
```yaml config.yml
3033
...
31-
hf_cache:
34+
model_cache:
3235
- repo_id: madebyollin/sdxl-vae-fp16-fix
3336
allow_patterns:
3437
- config.json
@@ -51,7 +54,7 @@ Many Hugging Face repos have model weights in different formats (`.bin`, `.safet
5154
There are also some additional steps depending on the cloud bucket you want to query.
5255

5356
### Hugging Face 🤗
54-
For any public Hugging Face repo, you don't need to do anything else. Adding the `hf_cache` key with an appropriate `repo_id` should be enough.
57+
For any public Hugging Face repo, you don't need to do anything else. Adding the `model_cache` key with an appropriate `repo_id` should be enough.
5558

5659
However, if you want to deploy a model from a gated repo like [Llama 2](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) to Baseten, there's a few steps you need to take:
5760
<Steps>
@@ -86,15 +89,17 @@ Weights will be cached in the default Hugging Face cache directory, `~/.cache/hu
8689
### Google Cloud Storage
8790
Google Cloud Storage is a great alternative to Hugging Face when you have a custom model or fine-tune you want to gate, especially if you are already using GCP and care about security and compliance.
8891

89-
Your `hf_cache` should look something like this:
92+
Your `model_cache` should look something like this:
9093

9194
```yaml config.yml
9295
...
93-
hf_cache:
96+
model_cache:
9497
repo_id: gcs://path-to-my-bucket
9598
...
9699
```
97100

101+
If you are accessing a public GCS bucket, you can ignore the following steps, but make sure you set appropriate permissions on your bucket. Users should be able to list and view all files. Otherwise, the model build will fail.
102+
98103
For a private GCS bucket, first export your service account key. Rename it to be `service_account.json` and add it to the `data` directory of your Truss.
99104

100105
Your file structure should look something like this:
@@ -111,9 +116,53 @@ your-truss
111116
If you are using version control, like git, for your Truss, make sure to add `service_account.json` to your `.gitignore` file. You don't want to accidentally expose your service account key.
112117
</Warning>
113118

114-
Weights will be cached at `/app/hf_cache/{your_bucket_name}`.
119+
Weights will be cached at `/app/model_cache/{your_bucket_name}`.
120+
121+
122+
### Amazon Web Services S3
123+
124+
Another popular cloud storage option for hosting model weights is AWS S3, especially if you're already using AWS services.
125+
126+
Your `model_cache` should look something like this:
127+
128+
```yaml config.yml
129+
...
130+
model_cache:
131+
repo_id: s3://path-to-my-bucket
132+
...
133+
```
134+
135+
If you are accessing a public GCS bucket, you can ignore the subsequent steps, but make sure you set an appropriate appropriate policy on your bucket. Users should be able to list and view all files. Otherwise, the model build will fail.
136+
137+
However, for a private S3 bucket, you need to first find your `aws_access_key_id`, `aws_secret_access_key`, and `aws_region` in your AWS dashboard. Create a file named `s3_credentials.json`. Inside this file, add the credentials that you identified earlier as shown below. Place this file into the `data` directory of your Truss.
138+
139+
Here is an example of how your `s3_credentials.json` file should look:
140+
141+
```json
142+
{
143+
"aws_access_key_id": "YOUR-ACCESS-KEY",
144+
"aws_secret_access_key": "YOUR-SECRET-ACCESS-KEY",
145+
"aws_region": "YOUR-REGION"
146+
}
147+
```
148+
149+
Your overall file structure should now look something like this:
150+
151+
```
152+
your-truss
153+
|--model
154+
| └── model.py
155+
|--data
156+
|. └── s3_credentials.json
157+
```
158+
159+
<Warning>
160+
If you are using version control, like git, for your Truss, make sure to add `s3_credentials.json` to your `.gitignore` file. You don't want to accidentally expose your service account key.
161+
</Warning>
162+
163+
Weights will be cached at `/app/model_cache/{your_bucket_name}`.
115164

116165

117166
### Other Buckets
118167

119-
We're currently workign on adding support for more bucket types, including AWS S3. If you have any suggestions, please [leave an issue](https://github.com/basetenlabs/truss/issues) on our GitHub repo.
168+
We can work with you to support additional bucket types if needed. If you have any suggestions, please [leave an issue](https://github.com/basetenlabs/truss/issues) on our GitHub repo.

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "truss"
3-
version = "0.7.15rc0"
3+
version = "0.7.15rc1"
44
description = "A seamless bridge from model development to model delivery"
55
license = "MIT"
66
readme = "README.md"

0 commit comments

Comments
 (0)