Docs: Add model and inline loading documentation

Sorely required due to the amount of questions about how does inline loading work. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
theroyallab · Feb 25, 2025 · 7368867 · 7368867
1 parent 35fe372
commit 7368867
Showing 1 changed file with 100 additions and 8 deletions.
diff --git a/docs/03.-Usage.md b/docs/03.-Usage.md
@@ -1,7 +1,9 @@
 ## Usage
+
 TabbyAPI's main use-case is to be an API server for running ExllamaV2 models.
 
 ### API Server
+
 Currently TabbyAPI supports clients that use the [OpenAI](https://platform.openai.com/docs/api-reference) standard and [KoboldAI](https://lite.koboldai.net/koboldcpp_api)'s API.
 
 In addition, there are expanded parameters to generation endpoints along with administrative endpoints for loading, unloading, loras, sampling overrides, etc.
@@ -10,11 +12,12 @@ In addition, there are expanded parameters to generation endpoints along with ad
 > If you are a developer and want to add full TabbyAPI support to your app, it's recommended to use the [autogenerated documentation](https://theroyallab.github.io/tabbyAPI).
 
 Below is an example CURL request using the OpenAI completions endpoint:
-```
+
+```bash
 curl http://localhost:5000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
-  "model": "meta-llama/Meta-Llama-3-8B",
+  "model": "Meta-Llama-3-8B-exl2",
   "prompt": "Once upon a time,",
   "max_tokens": 400,
   "stream": false,
@@ -24,23 +27,112 @@ curl http://localhost:5000/v1/completions \
 ```
 
 ### Authentication
+
 Every call to a TabbyAPI endpoint requires some form of authentication. Keys have two types of permissions:
 - API: Accesses non-invasive endpoints (ex. generation, model list fetching)
 - Admin: Allowed to access protected endpoints that deal with resources (ex. loading, unloading)
 
-In addition, when calling list endpoints, API keys will only fetch the currently loaded object while admin keys will list the entire directory. For example, calling `/v1/models` will return a list of the user-configured models directory only if an admin key is passed.
+In addition, when calling list endpoints, API keys will only fetch the currently loaded object while admin keys will list the entire directory. For example, calling /v1/models will return a list of the user-configured models directory only if an admin key is passed.
 
 Therefore, it's recommended to keep the admin key for yourself and only share the api key with users.
 
-If these keys get compromised, shut down your server, delete the `api_tokens.yml` file, and restart. This will generate new keys which you can share with users.
+If these keys get compromised, shut down your server, delete the api_tokens.yml file, and restart. This will generate new keys which you can share with users.
+
+To bypass authentication checks, set disable_auth to True in config.yml. However, turning off authentication without a third-party solution will make your instance open to the world.
+
+### Model loading
+
+> [!IMPORTANT]
+> All loading methods require an admin key.
+
+TabbyAPI has a set of endpoints used for model loading and unloading:
+- `/v1/model/load`: Loads a model from the configured model_dir with the provided parameters. If the provided model name is different than the currently loaded model, the existing model will be unloaded beforehand.
+- `/v1/model/unload`: Unloads all model tensors and terminates all pending generation requests. Only use this endpoint in an emergency or if you want to free VRAM but leave the server running.
+
+Please note that load requests are ephemeral and `config.yml` options will not apply. If you want to apply options from your config as fallback defaults, add them in the `use_as_default` key under model options.
+
+Below is an example CURL request using the model load endpoint:
+
+```bash
+curl http://localhost:5000/v1/model/load \
+-H "Content-Type: application/json" \
+-d '{
+  "model_name": "Meta-Llama-3-8B-exl2",
+  "max_seq_len": 8192,
+  "tensor_parallel": true,
+  "gpu_split_auto": false,
+  "gpu_split": [20, 25],
+  "cache_mode": "Q8"
+}'
+```
+
+A model load request can also include draft model parameters:
+
+```bash
+curl http://localhost:5000/v1/model/load \
+-H "Content-Type: application/json" \
+-d '{
+  "model_name": "Meta-Llama-3-8B-exl2",
+  ... Other parameters
+  "draft_model": {
+	draft_model_name: "TinyLlama-1B-32k-exl2",
+	draft_rope_scale: 1.0
+  }
+}'
+```
+
+For more information on what parameters can be passed, please look at the [autogenerated documentation](https://theroyallab.github.io/tabbyAPI/#operation/load_model_v1_model_load_post)
 
-To bypass authentication checks, set `disable_auth` to `True` in config.yml. However, turning off authentication without a third-party solution will make your instance open to the world.
+### Inline loading
+
+> [!NOTE]
+> If you want more fine-grained customization when loading, it's highly recommended to use the traditional endpoints instead. Inline loading is an alternative method  that doesn't allow parameters to be passed on request.
+
+An alternative way of switching models is called "inline loading" which hooks into the model switching logic used by frontends such as OpenWebUI. As previously stated, model requests are ephemeral, so there needs to be a way to provide per-model defaults while also being explicit to the admin user. This is where `tabby_config.yml` comes into play.
+
+To get started, set `inline_model_loading` to `true` under the model block of config.yml.
+
+Now to create a tabby config, let's say we have a model in our models directory called `Meta-Llama-3-8B-exl2`. Navigate into that model folder and create a file called `tabby_config.yml`
+
+> [!NOTE]
+> The formatting for tabby_config.yml may change in the future for consistency with config.yml. Please keep an eye out for breaking changes.
+
+Now, you can place any model load parameter from `/v1/model/load` into that file. Here's a simple example which changes the default `max_seq_len` to 8192 and sets a Q6 quantized cache:
+
+```yml
+max_seq_len: 8192
+cache_mode: Q6
+```
+
+If you'd like to provide draft model options, you can add them under the `draft_model` key:
+
+```yml
+max_seq_len: 8192
+cache_mode: Q6
+draft_model:
+	draft_model_name: TinyLlama-1B-32k-exl2
+	draft_rope_scale: 1.0
+```
+
+To switch the currently loaded model, send a request to `/v1/completions` or `/v1/chat/completions` with the `model` parameter set to the desired folder.
+
+Below is an example CURL request for inline loading:
+
+```bash
+curl http://localhost:5000/v1/completions \
+-H "Content-Type: application/json" \
+-d '{
+  "model": "Meta-Llama-3-8B-exl2"
+  ... Other parameters
+}'
+```
+
+This will unload the existing model and load the new model with defaults specified from `tabby_config.yml`
 
 ### Difficult to get started?
-Is the API difficult? Don't want to load models with `config.yml`? That's okay! Not everyone is a master user of AI products when starting out. 
+
+Is the API difficult? Don't want to load models with config.yml? That's okay! Not everyone is a master user of AI products when starting out. 
 
 For newer users, it's recommended to use a UI that allows for managing TabbyAPI via API endpoints.
 
 To find UI projects, take a look at [Community Projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects) for more information.
-
-The [Discord](https://discord.gg/sYQxnuD7Fj) is also a great place to ask for help. Please be nice when asking questions as all the developers are volunteers who have lives outside of TabbyAPI.