Skip to content

Commit bb97a05

Browse files
Add TGI documentation (#571)
* Add TGI documentation * Update mint.json --------- Co-authored-by: Philip Kiely - Baseten <98474633+philipkiely-baseten@users.noreply.github.com>
1 parent fd9b905 commit bb97a05

File tree

2 files changed

+93
-2
lines changed

2 files changed

+93
-2
lines changed
+92-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,94 @@
11
---
2-
title: Serve models with TGI
3-
description: "Description"
2+
title: Serve LLM models with TGI
3+
description: "Deploy a language model using TGI"
44
---
5+
6+
[TGI](https://github.com/huggingface/text-generation-inference/tree/main) is a model server optimized for language models.
7+
8+
<Tip>
9+
You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.
10+
</Tip>
11+
12+
This example will cover:
13+
14+
1. Generating the base Truss
15+
2. Setting sufficient model resources for inference
16+
3. Deploying the model
17+
18+
### Step 1: Generating the base Truss
19+
20+
Get started by creating a new Truss:
21+
22+
```sh
23+
truss init --backend TGI opt125
24+
```
25+
26+
You're going to see a couple of prompts. Follow along with the instructions below:
27+
1. Type `facebook/opt-125M` when prompted for `model`.
28+
2. Press the `tab` key when prompted for `endpoint`. Select the `generate_stream` endpoint.
29+
3. Give your model a name like `OPT-125M`.
30+
31+
Finally, navigate to the directory:
32+
33+
```sh
34+
cd opt125
35+
```
36+
37+
### Step 2: Setting resources and other arguments
38+
39+
You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor.
40+
41+
OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following:
42+
43+
```yaml config.yaml
44+
resources:
45+
accelerator: T4
46+
cpu: "4"
47+
memory: 16Gi
48+
use_gpu: true
49+
```
50+
51+
Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py).
52+
53+
### Step 3: Deploy the model
54+
55+
<Note>
56+
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step.
57+
</Note>
58+
59+
Let's deploy our OPT-125M vLLM model.
60+
61+
```sh
62+
truss push
63+
```
64+
65+
You can invoke the model with:
66+
67+
```sh
68+
truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128}}'
69+
```
70+
71+
<RequestExample>
72+
73+
```yaml config.yaml
74+
build:
75+
arguments:
76+
endpoint: generate_stream
77+
model: facebook/opt-125M
78+
model_server: TGI
79+
environment_variables: {}
80+
external_package_dirs: []
81+
model_metadata: {}
82+
model_name: OPT-125M
83+
python_version: py39
84+
requirements: []
85+
resources:
86+
accelerator: T4
87+
cpu: "4"
88+
memory: 16Gi
89+
use_gpu: true
90+
secrets: {}
91+
system_packages: []
92+
```
93+
94+
</RequestExample>

docs/mint.json

+1
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@
5959
"examples/system-packages",
6060
"examples/streaming",
6161
"examples/performance/cached-weights",
62+
"examples/performance/tgi-server",
6263
"examples/performance/vllm-server"
6364
]
6465
},

0 commit comments

Comments
 (0)