|
1 | 1 | ---
|
2 |
| -title: Serve models with TGI |
3 |
| -description: "Description" |
| 2 | +title: Serve LLM models with TGI |
| 3 | +description: "Deploy a language model using TGI" |
4 | 4 | ---
|
| 5 | + |
| 6 | +[TGI](https://github.com/huggingface/text-generation-inference/tree/main) is a model server optimized for language models. |
| 7 | + |
| 8 | +<Tip> |
| 9 | +You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it. |
| 10 | +</Tip> |
| 11 | + |
| 12 | +This example will cover: |
| 13 | + |
| 14 | +1. Generating the base Truss |
| 15 | +2. Setting sufficient model resources for inference |
| 16 | +3. Deploying the model |
| 17 | + |
| 18 | +### Step 1: Generating the base Truss |
| 19 | + |
| 20 | +Get started by creating a new Truss: |
| 21 | + |
| 22 | +```sh |
| 23 | +truss init --backend TGI opt125 |
| 24 | +``` |
| 25 | + |
| 26 | +You're going to see a couple of prompts. Follow along with the instructions below: |
| 27 | +1. Type `facebook/opt-125M` when prompted for `model`. |
| 28 | +2. Press the `tab` key when prompted for `endpoint`. Select the `generate_stream` endpoint. |
| 29 | +3. Give your model a name like `OPT-125M`. |
| 30 | + |
| 31 | +Finally, navigate to the directory: |
| 32 | + |
| 33 | +```sh |
| 34 | +cd opt125 |
| 35 | +``` |
| 36 | + |
| 37 | +### Step 2: Setting resources and other arguments |
| 38 | + |
| 39 | +You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor. |
| 40 | + |
| 41 | +OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following: |
| 42 | + |
| 43 | +```yaml config.yaml |
| 44 | +resources: |
| 45 | + accelerator: T4 |
| 46 | + cpu: "4" |
| 47 | + memory: 16Gi |
| 48 | + use_gpu: true |
| 49 | +``` |
| 50 | +
|
| 51 | +Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py). |
| 52 | + |
| 53 | +### Step 3: Deploy the model |
| 54 | + |
| 55 | +<Note> |
| 56 | +You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. |
| 57 | +</Note> |
| 58 | + |
| 59 | +Let's deploy our OPT-125M vLLM model. |
| 60 | + |
| 61 | +```sh |
| 62 | +truss push |
| 63 | +``` |
| 64 | + |
| 65 | +You can invoke the model with: |
| 66 | + |
| 67 | +```sh |
| 68 | +truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128}}' |
| 69 | +``` |
| 70 | + |
| 71 | +<RequestExample> |
| 72 | + |
| 73 | +```yaml config.yaml |
| 74 | +build: |
| 75 | + arguments: |
| 76 | + endpoint: generate_stream |
| 77 | + model: facebook/opt-125M |
| 78 | + model_server: TGI |
| 79 | +environment_variables: {} |
| 80 | +external_package_dirs: [] |
| 81 | +model_metadata: {} |
| 82 | +model_name: OPT-125M |
| 83 | +python_version: py39 |
| 84 | +requirements: [] |
| 85 | +resources: |
| 86 | + accelerator: T4 |
| 87 | + cpu: "4" |
| 88 | + memory: 16Gi |
| 89 | + use_gpu: true |
| 90 | +secrets: {} |
| 91 | +system_packages: [] |
| 92 | +``` |
| 93 | + |
| 94 | +</RequestExample> |
0 commit comments