Skip to content

Commit b906180

Browse files
authored
Merge pull request #689 from basetenlabs/bump-version-0.7.9
Release 0.7.9
2 parents ff2e1ae + 38d4aad commit b906180

26 files changed

+285
-69
lines changed

docs/examples/performance/tgi-server.mdx

+12-12
Original file line numberDiff line numberDiff line change
@@ -20,43 +20,43 @@ This example will cover:
2020
Get started by creating a new Truss:
2121

2222
```sh
23-
truss init --backend TGI opt125
23+
truss init --backend TGI falcon-7b
2424
```
2525

2626
You're going to see a couple of prompts. Follow along with the instructions below:
27-
1. Type `facebook/opt-125M` when prompted for `model`.
27+
1. Type `tiiuae/falcon-7b` when prompted for `model`.
2828
2. Press the `tab` key when prompted for `endpoint`. Select the `generate_stream` endpoint.
29-
3. Give your model a name like `OPT-125M`.
29+
3. Give your model a name like `Falcon 7B`.
3030

3131
Finally, navigate to the directory:
3232

3333
```sh
34-
cd opt125
34+
cd falcon-7b
3535
```
3636

3737
### Step 2: Setting resources and other arguments
3838

3939
You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor.
4040

41-
OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following:
41+
Falcon 7B will need a GPU so let's set the correct resources. Update the `resources` key with the following:
4242

4343
```yaml config.yaml
4444
resources:
45-
accelerator: T4
45+
accelerator: A10G
4646
cpu: "4"
4747
memory: 16Gi
4848
use_gpu: true
4949
```
5050
51-
Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py).
51+
Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying TGI server.
5252

5353
### Step 3: Deploy the model
5454

5555
<Note>
5656
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step.
5757
</Note>
5858

59-
Let's deploy our OPT-125M vLLM model.
59+
Let's deploy our Falcon 7B TGI model.
6060

6161
```sh
6262
truss push
@@ -65,7 +65,7 @@ truss push
6565
You can invoke the model with:
6666

6767
```sh
68-
truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}} --published'
68+
truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}}' --published
6969
```
7070

7171
<RequestExample>
@@ -74,16 +74,16 @@ truss predict -d '{"inputs": "What is a large language model?", "parameters": {"
7474
build:
7575
arguments:
7676
endpoint: generate_stream
77-
model: facebook/opt-125M
77+
model: tiiuae/falcon-7b
7878
model_server: TGI
7979
environment_variables: {}
8080
external_package_dirs: []
8181
model_metadata: {}
82-
model_name: OPT-125M
82+
model_name: Falcon 7B
8383
python_version: py39
8484
requirements: []
8585
resources:
86-
accelerator: T4
86+
accelerator: A10G
8787
cpu: "4"
8888
memory: 16Gi
8989
use_gpu: true

docs/guides/concurrency.mdx

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
title: "How to configure concurrency"
3+
description: "A guide to setting concurrency for your model"
4+
---
5+
6+
Configuring concurrency is one of the major knobs available for getting the most performance
7+
out of your model. In this doc, we'll cover the options that are available to you.
8+
9+
# What is concurrency, and why configure it?
10+
11+
At a very high level, "concurrency" in this context refers to how many requests a single replica can
12+
process at the same time. There are no right answers to what this number ought to be -- the specifics
13+
of your model and the metrics you are optimizing for (throughput? latency?) matter a lot for determining this.
14+
15+
In Baseten & Truss, there are two notions of concurrency:
16+
* **Concurrency Target** -- the number of requests that will be sent to a model at the same time
17+
* **Predict Concurrency** -- once requests have made it onto the model container, the "predict concurrency" governs how many
18+
requests can go through the `predict` function on your Truss at once.
19+
20+
# Concurrency Target
21+
22+
The concurrency target is set in the Baseten UI, and to re-iterate, governs the maximum number of requests that will be sent
23+
to a single model replica.
24+
25+
<Frame>
26+
<img src="/images/concurrency-target-picture.png" />
27+
</Frame>
28+
29+
An important note about this setting is that it is also used as a part of the auto-scaling parameters. If all replicas have
30+
hit their Concurrency Target, this triggers Baseten's autoscaling.
31+
32+
Let's dive into a concrete example:
33+
34+
<Frame>
35+
<img src="/images/concurrency-flow-chart-high-level.png" />
36+
</Frame>
37+
38+
Let's say that there is a single replica of a model, and the concurrency target is 2. If 5 requests come in, the first 2 will
39+
be sent to the replica, and the other 3 get queued up. Once the requests on the container complete the queued up
40+
requests will make it to the model container.
41+
42+
<Note>
43+
Remember that if all replicas have hit their concurrency target, this will trigger autoscaling. So in this specific example,
44+
the queuing of requests 3-5 will trigger another replica to come up, if the model has not hit its max replicas yet.
45+
</Note>
46+
47+
48+
# Predict Concurrency
49+
50+
Alright, so we've talked about the **Concurreny Target** feature that governs how many requests will be sent to a model at once.
51+
predict concurrency is a bit different -- it operates on the level of the model container and governs how many requests will go
52+
through the `predict` function concurrently.
53+
54+
To get a sense for why this matters, let's recap the structure of a Truss:
55+
56+
```python model.py
57+
class Model:
58+
59+
def __init__(self):
60+
...
61+
62+
def preprocess(self, request):
63+
...
64+
65+
def predict(self, request):
66+
...
67+
68+
def postprocess(self, response):
69+
...
70+
```
71+
72+
In this Truss model, there are three functions that are called in order to serve a request:
73+
* **preprocess** -- this function is used to perform any prework / modifications on the request before the `predict` function
74+
runs. For instance, if you are running an image classification model, and need to download images from S3, this is a good placeholder
75+
to do it.
76+
* **predict** -- this function is where the actual inference happens. It is likely where the logic that runs GPU code lives
77+
* **postprocess** -- this function is used to perform any postwork / modifications on the response before it is returned to the
78+
user. For instance, if you are running a text-to-image model, this is a good place to implement the logic for uploading an image
79+
to S3.
80+
81+
You can see with these three functions and the behaviors that they are used for that you might want to have different
82+
levels of concurrency for the `predict` function. The most common need here is to limit access to the GPU, since multiple
83+
requests running on the GPU at the same time could cause serious degradation in performance.
84+
85+
Unlike **Concurrency Target**, which is configured in the Baseten UI, the **Predict Concurrency** is configured as a part
86+
of the Truss Config (in the `config.yaml` file).
87+
88+
```yaml config.yaml
89+
model_name: "My model with concurrency limits"
90+
...
91+
runtime:
92+
predict_concurrency: 2 # the default is 1
93+
...
94+
```
95+
96+
To better understand this, let's use a specific example:
97+
98+
<Frame>
99+
<img src="/images/concurrency-flow-model-pod.png" />
100+
</Frame>
101+
102+
Let's say predict concurrency is 1.
103+
1. Two requests come in to the pod.
104+
2. Both requests will begin preprocessing immediately (let's say,
105+
downloading images from S3).
106+
3. Once the first request finishes preprocessing, it will begin running on the GPU. The second request
107+
will then remain queued until the first request finishes running on the GPU in predict.
108+
4. After the first request finishes, the second request will begin being processed on the GPU
109+
5. Once the second request finishes, it will begin postprocessing, even if the first request is not done postprocessing
110+
111+
To reiterate, predict concurrency is really great to use if you want to protect your GPU resource on your model pod,
112+
while still allowing for high concurrency for the pre and post-process steps.
113+
114+
<Note>
115+
Remember that to actually achieve the predict concurrency you desire, the Concurrency Target must be at least that amount,
116+
so that the requests make it to the model container.
117+
</Note>
Loading
127 KB
Loading
180 KB
Loading

docs/mint.json

+2-1
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,8 @@
6868
"group": "Guides",
6969
"pages": [
7070
"guides/secrets",
71-
"guides/base-images"
71+
"guides/base-images",
72+
"guides/concurrency"
7273
]
7374
},
7475
{

docs/reference/cli/init.mdx

+4-9
Original file line numberDiff line numberDiff line change
@@ -7,28 +7,23 @@ description: "Create a new Truss."
77
truss init [OPTIONS] TARGET_DIRECTORY
88
```
99

10-
### Options
11-
12-
<ParamField body="-t, --trainable">
13-
Create a trainable truss. Deprecated.
14-
</ParamField>
10+
## Options
1511

1612
<ParamField body="-b, --backend" type="TrussServer|TGI|VLLM">
1713
What type of server to create. Default: `TrussServer`.
1814
</ParamField>
19-
2015
<ParamField body="--help">
2116
Show help message and exit.
2217
</ParamField>
2318

24-
### Arguments
19+
## Arguments
2520

2621
<ParamField body="TARGET_DIRECTORY" type="str">
27-
A Truss is created in this directory
22+
A Truss is created in this directory.
2823
</ParamField>
2924

3025

31-
### Example
26+
## Example
3227

3328
```
3429
truss init whisper-truss

docs/reference/cli/predict.mdx

+12-7
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,8 @@ description: "Invokes the packaged model."
77
truss predict [OPTIONS]
88
```
99

10-
### Options
10+
## Options
1111

12-
<ParamField body="--target_directory" type="TEXT">
13-
A Truss directory. If none, use current directory.
14-
</ParamField>
1512
<ParamField body="--remote" type="TEXT">
1613
Name of the remote in .trussrc to patch changes to.
1714
</ParamField>
@@ -21,15 +18,23 @@ String formatted as json that represents request.
2118
<ParamField body="-f, --file" type="PATH">
2219
Path to json file containing the request.
2320
</ParamField>
24-
<ParamField body="--published">
25-
Invoked the published model version.
21+
<ParamField body="--model_version" type="TEXT">
22+
ID of model version to invoke.
23+
</ParamField>
24+
<ParamField body="--model" type="TEXT">
25+
ID of model to invoke.
2626
</ParamField>
2727
<ParamField body="--help">
2828
Show help message and exit.
2929
</ParamField>
3030

31+
## Arguments
32+
33+
<ParamField body="TARGET_DIRECTORY" type="Optional">
34+
A Truss directory. If none, use current directory.
35+
</ParamField>
3136

32-
### Examples
37+
## Examples
3338

3439
```
3540
truss predict -d '{"prompt": "What is the meaning of life?"}'

docs/reference/cli/push.mdx

+6-6
Original file line numberDiff line numberDiff line change
@@ -7,29 +7,29 @@ description: "Pushes a truss to a TrussRemote."
77
truss push [OPTIONS] [TARGET_DIRECTORY]
88
```
99

10-
### Options
10+
## Options
1111

1212
<ParamField body="--remote" type="TEXT">
13-
Name of the remote in .trussrc to patch changes to
13+
Name of the remote in .trussrc to patch changes to.
1414
</ParamField>
15-
<ParamField body="--publish">
15+
<ParamField body="--publish" type="BOOL">
1616
Publish truss as production deployment.
1717
</ParamField>
18-
<ParamField body="--trusted">
18+
<ParamField body="--trusted" type="BOOL">
1919
Give Truss access to secrets on remote host.
2020
</ParamField>
2121
<ParamField body="--help">
2222
Show help message and exit.
2323
</ParamField>
2424

25-
### Arguments
25+
## Arguments
2626

2727
<ParamField body="TARGET_DIRECTORY" type="Optional">
2828
A Truss directory. If none, use current directory.
2929
</ParamField>
3030

3131

32-
### Examples
32+
## Examples
3333

3434
```
3535
truss push

docs/reference/cli/watch.mdx

+4-2
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,11 @@ truss watch [OPTIONS] [TARGET_DIRECTORY]
1010
### Options
1111

1212
<ParamField body="--remote" type="TEXT">
13-
Name of the remote in .trussrc to patch changes to
13+
Name of the remote in .trussrc to patch changes to.
14+
</ParamField>
15+
<ParamField body="--logs" type="BOOL">
16+
Automatically open remote logs tab.
1417
</ParamField>
15-
1618
<ParamField body="--help">
1719
Show help message and exit.
1820
</ParamField>

poetry.lock

+26-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "truss"
3-
version = "0.7.8"
3+
version = "0.7.9"
44
description = "A seamless bridge from model development to model delivery"
55
license = "MIT"
66
readme = "README.md"
@@ -82,6 +82,7 @@ pytest-split = "^0.8.1"
8282
httpx = {extras = ["cli"], version = "^0.24.1"}
8383
requests-mock = "^1.11.0"
8484
flask = "^2.3.3"
85+
types-requests = "2.31.0.2"
8586

8687
[build-system]
8788
requires = ["poetry-core>=1.2.1"]

0 commit comments

Comments
 (0)