Skip to content

Commit

Permalink
Docs: Add model and inline loading documentation
Browse files Browse the repository at this point in the history
Sorely required due to the amount of questions about how does inline
loading work.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
  • Loading branch information
kingbri1 committed Feb 25, 2025
1 parent 35fe372 commit 7368867
Showing 1 changed file with 100 additions and 8 deletions.
108 changes: 100 additions & 8 deletions docs/03.-Usage.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
## Usage

TabbyAPI's main use-case is to be an API server for running ExllamaV2 models.

### API Server

Currently TabbyAPI supports clients that use the [OpenAI](https://platform.openai.com/docs/api-reference) standard and [KoboldAI](https://lite.koboldai.net/koboldcpp_api)'s API.

In addition, there are expanded parameters to generation endpoints along with administrative endpoints for loading, unloading, loras, sampling overrides, etc.
Expand All @@ -10,11 +12,12 @@ In addition, there are expanded parameters to generation endpoints along with ad
> If you are a developer and want to add full TabbyAPI support to your app, it's recommended to use the [autogenerated documentation](https://theroyallab.github.io/tabbyAPI).
Below is an example CURL request using the OpenAI completions endpoint:
```

```bash
curl http://localhost:5000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"model": "Meta-Llama-3-8B-exl2",
"prompt": "Once upon a time,",
"max_tokens": 400,
"stream": false,
Expand All @@ -24,23 +27,112 @@ curl http://localhost:5000/v1/completions \
```

### Authentication

Every call to a TabbyAPI endpoint requires some form of authentication. Keys have two types of permissions:
- API: Accesses non-invasive endpoints (ex. generation, model list fetching)
- Admin: Allowed to access protected endpoints that deal with resources (ex. loading, unloading)

In addition, when calling list endpoints, API keys will only fetch the currently loaded object while admin keys will list the entire directory. For example, calling `/v1/models` will return a list of the user-configured models directory only if an admin key is passed.
In addition, when calling list endpoints, API keys will only fetch the currently loaded object while admin keys will list the entire directory. For example, calling /v1/models will return a list of the user-configured models directory only if an admin key is passed.

Therefore, it's recommended to keep the admin key for yourself and only share the api key with users.

If these keys get compromised, shut down your server, delete the `api_tokens.yml` file, and restart. This will generate new keys which you can share with users.
If these keys get compromised, shut down your server, delete the api_tokens.yml file, and restart. This will generate new keys which you can share with users.

To bypass authentication checks, set disable_auth to True in config.yml. However, turning off authentication without a third-party solution will make your instance open to the world.

### Model loading

> [!IMPORTANT]
> All loading methods require an admin key.
TabbyAPI has a set of endpoints used for model loading and unloading:
- `/v1/model/load`: Loads a model from the configured model_dir with the provided parameters. If the provided model name is different than the currently loaded model, the existing model will be unloaded beforehand.
- `/v1/model/unload`: Unloads all model tensors and terminates all pending generation requests. Only use this endpoint in an emergency or if you want to free VRAM but leave the server running.

Please note that load requests are ephemeral and `config.yml` options will not apply. If you want to apply options from your config as fallback defaults, add them in the `use_as_default` key under model options.

Below is an example CURL request using the model load endpoint:

```bash
curl http://localhost:5000/v1/model/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "Meta-Llama-3-8B-exl2",
"max_seq_len": 8192,
"tensor_parallel": true,
"gpu_split_auto": false,
"gpu_split": [20, 25],
"cache_mode": "Q8"
}'
```

A model load request can also include draft model parameters:

```bash
curl http://localhost:5000/v1/model/load \
-H "Content-Type: application/json" \
-d '{
"model_name": "Meta-Llama-3-8B-exl2",
... Other parameters
"draft_model": {
draft_model_name: "TinyLlama-1B-32k-exl2",
draft_rope_scale: 1.0
}
}'
```

For more information on what parameters can be passed, please look at the [autogenerated documentation](https://theroyallab.github.io/tabbyAPI/#operation/load_model_v1_model_load_post)

To bypass authentication checks, set `disable_auth` to `True` in config.yml. However, turning off authentication without a third-party solution will make your instance open to the world.
### Inline loading

> [!NOTE]
> If you want more fine-grained customization when loading, it's highly recommended to use the traditional endpoints instead. Inline loading is an alternative method that doesn't allow parameters to be passed on request.
An alternative way of switching models is called "inline loading" which hooks into the model switching logic used by frontends such as OpenWebUI. As previously stated, model requests are ephemeral, so there needs to be a way to provide per-model defaults while also being explicit to the admin user. This is where `tabby_config.yml` comes into play.

To get started, set `inline_model_loading` to `true` under the model block of config.yml.

Now to create a tabby config, let's say we have a model in our models directory called `Meta-Llama-3-8B-exl2`. Navigate into that model folder and create a file called `tabby_config.yml`

> [!NOTE]
> The formatting for tabby_config.yml may change in the future for consistency with config.yml. Please keep an eye out for breaking changes.
Now, you can place any model load parameter from `/v1/model/load` into that file. Here's a simple example which changes the default `max_seq_len` to 8192 and sets a Q6 quantized cache:

```yml
max_seq_len: 8192
cache_mode: Q6
```
If you'd like to provide draft model options, you can add them under the `draft_model` key:

```yml
max_seq_len: 8192
cache_mode: Q6
draft_model:
draft_model_name: TinyLlama-1B-32k-exl2
draft_rope_scale: 1.0
```

To switch the currently loaded model, send a request to `/v1/completions` or `/v1/chat/completions` with the `model` parameter set to the desired folder.

Below is an example CURL request for inline loading:

```bash
curl http://localhost:5000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Meta-Llama-3-8B-exl2"
... Other parameters
}'
```

This will unload the existing model and load the new model with defaults specified from `tabby_config.yml`

### Difficult to get started?
Is the API difficult? Don't want to load models with `config.yml`? That's okay! Not everyone is a master user of AI products when starting out.

Is the API difficult? Don't want to load models with config.yml? That's okay! Not everyone is a master user of AI products when starting out.

For newer users, it's recommended to use a UI that allows for managing TabbyAPI via API endpoints.

To find UI projects, take a look at [Community Projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects) for more information.

The [Discord](https://discord.gg/sYQxnuD7Fj) is also a great place to ask for help. Please be nice when asking questions as all the developers are volunteers who have lives outside of TabbyAPI.

0 comments on commit 7368867

Please sign in to comment.