basetenlabs
diff --git a/‎docs/examples/performance/cached-weights.mdx
+2-2 b/‎docs/examples/performance/cached-weights.mdx
+2-2
diff --git a/‎docs/guides/model-cache.mdx
+119 b/‎docs/guides/model-cache.mdx
+119
diff --git a/‎docs/guides/tgi.mdx
+139 b/‎docs/guides/tgi.mdx
+139
diff --git a/‎docs/mint.json
+10-8 b/‎docs/mint.json
+10-8
diff --git a/‎pyproject.toml
+1-1 b/‎pyproject.toml
+1-1
diff --git a/‎truss/cli/cli.py
+9-19 b/‎truss/cli/cli.py
+9-19
@@ -1,6 +1,6 @@
 ---
-title: Load cached model weights
-description: "Deploy a model with private Hugging Face weights"
+title: Deploy Llama 2 with Caching
+description: "Enable fast cold starts for a model with private Hugging Face weights"
 ---
 
 In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo.
 
@@ -0,0 +1,119 @@
+---
+title: Caching model weights
+description: "Accelerate cold starts by caching your weights"
+---
+
+Truss natively supports automatic caching for model weights. This is a simple yet effective strategy to enhance deployment speed and operational efficiency when it comes to cold starts and scaling beyond a single replica.
+
+<Tip>
+### What is a "cold start"?
+
+"Cold start" is a term used to refer to the time taken by a model to boot up after being idle. This process can become a critical factor in serverless environments, as it can significantly influence the model response time, customer satisfaction, and cost.
+
+Without caching our model's weights, we would need to download weights every time we scale up. Caching model weights circumvents this download process. When our new instance boots up, the server automatically finds the cached weights and can proceed with starting up the endpoint.
+
+In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.
+
+</Tip>
+
+## Enabling Caching for a Model
+
+To enable caching, simply add `hf_cache` to your `config.yml` with a valid `repo_id`. The `hf_cache` has a few key configurations:
+- `repo_id` (required): The endpoint for your cloud bucket. Currently, we support Hugging Face and Google Cloud Storage.
+- `revision`: Points to your revision. This is only relevant if you are pulling By default, it refers to `main`.
+- `allow_patterns`: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.
+- `ignore_patterns`: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.
+
+Here is an example of a well written `hf_cache` for Stable Diffusion XL. Note how it only pulls the model weights that it needs using `allow_patterns`.
+
+```yaml config.yml
+...
+hf_cache:
+  - repo_id: madebyollin/sdxl-vae-fp16-fix
+    allow_patterns:
+      - config.json
+      - diffusion_pytorch_model.safetensors
+  - repo_id: stabilityai/stable-diffusion-xl-base-1.0
+    allow_patterns:
+      - "*.json"
+      - "*.fp16.safetensors"
+      - sd_xl_base_1.0.safetensors
+  - repo_id: stabilityai/stable-diffusion-xl-refiner-1.0
+    allow_patterns:
+      - "*.json"
+      - "*.fp16.safetensors"
+      - sd_xl_refiner_1.0.safetensors
+...
+```
+
+Many Hugging Face repos have model weights in different formats (`.bin`, `.safetensors`, `.h5`, `.msgpack`, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.
+
+There are also some additional steps depending on the cloud bucket you want to query.
+
+### Hugging Face 🤗
+For any public Hugging Face repo, you don't need to do anything else. Adding the `hf_cache` key with an appropriate `repo_id` should be enough.
+
+However, if you want to deploy a model from a gated repo like [Llama 2](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) to Baseten, there's a few steps you need to take:
+<Steps>
+  <Step title="Get Hugging Face API Key">
+    [Grab an API key](https://huggingface.co/settings/tokens) from Hugging Face with `read` access. Make sure you have access to the model you want to serve.
+  </Step>
+  <Step title="Add it to Baseten Secrets Manager">
+    Paste your API key in your [secrets manager in Baseten](https://app.baseten.co/settings/secrets) under the key `hf_access_token`. You can read more about secrets [here](https://truss.baseten.co/guides/secrets).
+  </Step>
+  <Step title="Update Config">
+    In your Truss's `config.yml`, add the following code:
+
+```yaml config.yml
+...
+secrets:
+    hf_access_token: null
+...
+```
+
+Make sure that the key `secrets` only shows up once in your `config.yml`.
+  </Step>
+</Steps>
+
+If you run into any issues, run through all the steps above again and make sure you did not misspell the name of the repo or paste an incorrect API key.
+
+<Tip>
+Weights will be cached in the default Hugging Face cache directory, `~/.cache/huggingface/hub/models--{your_model_name}/`. You can change this directory by setting the `HF_HOME` or `HUGGINGFACE_HUB_CACHE` environment variable in your `config.yml`.
+
+[Read more here](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
+</Tip>
+
+### Google Cloud Storage
+Google Cloud Storage is a great alternative to Hugging Face when you have a custom model or fine-tune you want to gate, especially if you are already using GCP and care about security and compliance.
+
+Your `hf_cache` should look something like this:
+
+```yaml config.yml
+...
+hf_cache:
+    repo_id: gcs://path-to-my-bucket
+...
+```
+
+For a private GCS bucket, first export your service account key. Rename it to be `service_account.json` and add it to the `data` directory of your Truss.
+
+Your file structure should look something like this:
+
+```
+your-truss
+|--model
+| └── model.py
+|--data
+|. └── service_account.json
+```
+
+<Warning>
+If you are using version control, like git, for your Truss, make sure to add `service_account.json` to your `.gitignore` file. You don't want to accidentally expose your service account key.
+</Warning>
+
+Weights will be cached at `/app/hf_cache/{your_bucket_name}`.
+
+
+### Other Buckets
+
+We're currently workign on adding support for more bucket types, including AWS S3. If you have any suggestions, please [leave an issue](https://github.com/basetenlabs/truss/issues) on our GitHub repo.
@@ -0,0 +1,139 @@
+---
+title: "High performance model serving with TGI"
+description: "A guide to using TGI for your model"
+---
+
+[TGI](https://huggingface.co/text-generation-inference), or Text Generation Inference, is a high performance model server for language models built by Huggingface. In this doc, we'll cover when to use TGI, how to tune TGI for performance and how to deploy it to Baseten.
+
+# What is TGI?
+
+TGI consists of 2 parts:
+
+1. A high performance, Rust-based server
+2. A set of optimized model implementations that outperform generic implementations
+
+To optimize GPU utilization and improve throughput & latency, it's common to implement optimizations at the server level and the model level.
+
+It's worth noting that TGI can only be used for a certain subset of models. As of September 2023, this includes:
+
+- Mistral 7B
+- Llama V2
+- Llama
+- MPT
+- Code Llama
+- Falcon 40B
+- Falcon 7B
+- FLAN-T5
+- BLOOM
+- Galactica
+- GPT-Neox
+- OPT
+- SantaCoder
+- Starcoder
+
+You should use TGI if you care about model latency / throughput and want to be able stream results. It's worth nothing that tuning TGI to achieve max performance requires a dedicated effort and is different for every model and what metric you'd like to optimize for.
+
+# How to use TGI
+
+To define a TGI truss, we'll use the truss CLI to generate the scaffold. Run the following in your terminal
+
+```
+truss init ./my-tgi-truss --backend TGI
+```
+
+Let's walk through what this command is doing
+```
+- "truss init": This CLI command initializes a empty scaffold for a variety of model backends that truss supports (TGI, VLLM, Triton and TrussServer)
+- "./my-tgi-truss": The `init` command requires a `target_directory` which is where this empty scaffold will be generated
+- "--backend TGI": This tells truss that we are specifically interested in a TGI truss, which instantiates a different scaffold than the default (TrussServer)
+```
+
+After running that command, you'll be prompted to pass a `model_id`.
+```
+$ truss init ./my-tgi-trus --backend TGI
+? model_id
+```
+
+One thing that makes Truss support in TGI a bit different from the default is that there isn't any model instantiation or model inference code that needs to be written! Because we leverage TGI under the hood, TGI handles loading the model in and running inference. All you need to do is pass in the model ID as it appears on Huggingface.
+
+After typing in your model ID, you'll be prompted with the `endpoint` you'd like to use.
+```
+$ truss init ./my-tgi-trus --backend TGI
+? model_id facebook/opt-125M
+? endpoint
+```
+
+TGI supports 2 endpoints:
+- `generate` which returns the entire generated response upon completion
+- `generate_stream` which streams the response as it's being generated
+
+
+You can press the `tab` key on any of these dialogues to see options for values.
+
+
+Finally, you'll be asked for the name of your model.
+
+# Deploying your TGI model
+
+Now that we have a TGI model, let's deploy this model to Baseten and see how it performs.
+
+You'll need an API key to deploy your model. You can get one by navigating to your Baseten settings [page](https://app.baseten.co/settings/account/api_keys). To push the model to Baseten, run the following command:
+```
+$ truss push --publish
+```
+
+Let's walk through what this command is doing
+```
+- "truss push": This CLI command zips and uploads your Truss to Baseten
+- "--publish": This tells Baseten that you want a Production endpoint (vs. a development environment)
+```
+
+After running this command, you'll be prompted to pass in your API key. Once you've done that, you can view your deployment status from the Baseten dashboard. Once the model has been deployed, you can invoke it by running the following command:
+
+```
+$ truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}} --published'
+```
+Let's walk through what this command is doing
+```
+- "truss predict": This CLI command invokes your model
+- "-d": This tells the CLI that the next argument is the data you want to pass to your model
+- '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}}': This is the data you want to pass to your model. TGI expects a JSON object with 2 keys: `inputs` and `parameters` (optional). `inputs` is the prompt you want to pass to your model and `parameters` is a JSON object of parameters you want to pass to your model.
+- "--published": This tells the CLI that you want to invoke the production endpoint (vs. a development endpoint)
+```
+
+# Tuning your TGI server
+
+After deploying your model, you may notice that you're not getting the performance you'd like out of TGI. If you navigate to the `target_directory` from above, you'll find a `config.yaml` file that contains a key `build`. The following is a set of arguments you can pass to the `build` key to tune TGI for max performance
+
+1. `max_input_length` (default: 1024)
+This parameter represents the maximum allowed input length expressed in the number of tokens.
+
+If you expect longer sequences as input, you might need to adjust this parameter accordingly, keeping in mind that longer sequences may impact the overall memory required to handle the load. If you know that you'll send less than the max length allowed by the model, it's advantageous to set this parameter. When TGI starts up, it calculates how much memory to reserve per request and setting this value allows TGI to accurately allocate the correct amount of memory. In short, higher values take up more GPU memory but afford you the ability to send longer prompts.
+
+2. `max_total_tokens` (default: 2048)
+
+This is a crucial parameter defining the "memory budget" of all requests. It is the total number of input tokens and max_new_tokens across all requests that TGI can handle at a given time. For example, with a value of 2048, users can send either a prompt of 1000 and ask for 1048 new tokens, or send a prompt of 1 and ask for 2047 max_new_tokens.
+
+If you encounter memory limitations or need to optimize memory usage based on client requirements, adjusting this parameter could help in efficiently managing the memory usage. Higher values take up more GPU memory but allow you longer (prompts + generated text). The tradeoff here is that higher values will increase throughput but will increase individual request latency.
+
+3. `max_batch_prefill_tokens` (default: 4096)
+
+TGI splits up the generation process into 2 phases: prefill and generation. The prefill phase reads in the input tokens and generates the first output token. This parameter controls, in a given batch of requests, how many tokens can be processed at once during the prefill phase. This value should be __at least__ the value for `max_input_length`.
+
+Similar to max input length, if your input tokens are constrained, this is worth setting as a function of (constrained input length) * (max batch size your hardware can handle). This setting is also worth defining when you want to impose stricter controls on the resource usage during prefill operations, especially when dealing with models having a large footprint or under constrained hardware environments.
+
+4. `max_batch_total_tokens`
+
+Similar to `max_batch_prefill_tokens`, this represents the entire token count across a batch; the total input tokens + the total generated tokens. In short, this value should be the top end of number of tokens that can fit on the GPU after the model has been loaded.
+
+This value is particularly important when maximizing GPU utilization. The tradeoff here is that higher values will increase throughput but will increase individual request latency.
+
+6. `max_waiting_tokens`
+
+This setting defines how many tokens can be passed before forcing the waiting queries to be put on the batch. Adjusting this value will help in optimizing the overall latency for end-users by managing how quickly waiting queries are allowed a slot in the running batch.
+
+Tune this parameter when optimizing for end-user latency and to manage the compute allocation between prefill and decode operations efficiently. The tradeoff here is that higher values will increase throughput but will increase individual request latency because of the additional time it takes to fill up the batch.
+
+7. `sharded`
+
+This setting defines whether to shard the model across multiple GPUs. Use this when deploying on multi-GPU setups and want to improve the resource utilization and potentially the throughput of the server. You'd want to disable this if you're loading other models on another GPU.
@@ -52,6 +52,16 @@
         "usage"
       ]
     },
+    {
+      "group": "Guides",
+      "pages": [
+        "guides/secrets",
+        "guides/base-images",
+        "guides/model-cache",
+        "guides/concurrency",
+        "guides/tgi"
+      ]
+    },
     {
       "group": "Examples",
       "pages": [
@@ -64,14 +74,6 @@
         "examples/performance/vllm-server"
       ]
     },
-    {
-      "group": "Guides",
-      "pages": [
-        "guides/secrets",
-        "guides/base-images",
-        "guides/concurrency"
-      ]
-    },
     {
       "group": "Remotes",
       "pages": [
 
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "truss"
-version = "0.7.9"
+version = "0.7.11"
 description = "A seamless bridge from model development to model delivery"
 license = "MIT"
 readme = "README.md"
 
@@ -3,15 +3,13 @@
 import logging
 import os
 import sys
-import webbrowser
 from functools import wraps
 from pathlib import Path
 from typing import Callable, Optional
 
 import rich
 import rich_click as click
 import truss
-from InquirerPy import inquirer
 from truss.cli.create import ask_name, select_server_backend
 from truss.remote.baseten.core import (
     ModelId,
@@ -228,15 +226,10 @@ def watch(
         )
         sys.exit(1)
 
-    logs_url = remote_provider.get_remote_logs_url(model_name)  # type: ignore[attr-defined]
+    service = remote_provider.get_service(model_identifier=ModelName(model_name))
+    logs_url = remote_provider.get_remote_logs_url(service)
     rich.print(f"🪵  View logs for your deployment at {logs_url}")
-    if not logs:
-        logs = inquirer.confirm(
-            message="🗂  Open logs in a new tab?", default=True
-        ).execute()
-    if logs:
-        webbrowser.open_new_tab(logs_url)
-    remote_provider.sync_truss_to_dev_version_by_name(model_name, target_directory)  # type: ignore
+    remote_provider.sync_truss_to_dev_version_by_name(model_name, target_directory)
 
 
 def _extract_and_validate_model_identifier(
@@ -349,7 +342,9 @@ def predict(
 
     request_data = _extract_request_data(data=data, file=file)
 
-    service = remote_provider.get_baseten_service(model_identifier, published)  # type: ignore
+    service = remote_provider.get_service(
+        model_identifier=model_identifier, published=published
+    )
     result = service.predict(request_data)
     if inspect.isgenerator(result):
         for chunk in result:
@@ -414,11 +409,11 @@ def push(
         tr.spec.config.write_to_yaml_file(tr.spec.config_path, verbose=False)
 
     # TODO(Abu): This needs to be refactored to be more generic
-    _ = remote_provider.push(tr, model_name, publish=publish, trusted=trusted)  # type: ignore
+    service = remote_provider.push(tr, model_name, publish=publish, trusted=trusted)  # type: ignore
 
     click.echo(f"✨ Model {model_name} was successfully pushed ✨")
 
-    if not publish:
+    if service.is_draft:
         draft_model_text = """
 |---------------------------------------------------------------------------------------|
 | Your model has been deployed as a draft. Draft models allow you to                    |
@@ -433,13 +428,8 @@ def push(
 
         click.echo(draft_model_text)
 
-    logs_url = remote_provider.get_remote_logs_url(model_name, publish)  # type: ignore[attr-defined]
+    logs_url = remote_provider.get_remote_logs_url(service)  # type: ignore[attr-defined]
     rich.print(f"🪵  View logs for your deployment at {logs_url}")
-    should_open_logs = inquirer.confirm(
-        message="🗂  Open logs in a new tab?", default=True
-    ).execute()
-    if should_open_logs:
-        webbrowser.open_new_tab(logs_url)
 
 
 @truss_cli.command()