-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📝 overhaul of the documentation, now 4.5x bigger (better?) #144
Merged
Merged
Changes from 8 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
30da685
feat(docs): overhaul of the documentation
baptistecolle c16495a
wip(ci): fix ci for the auto-generated docs
baptistecolle 3aebf6c
docs: address PR feedback of tengomucho
baptistecolle c58933c
fix(docs): add closing </Tip> prevent build error
baptistecolle bf1c3cc
docs: add curl cmd to docs and remove useless docker section
baptistecolle 43516b8
docs: fix _toctree consistency
baptistecolle d77901a
docs: adress the new PR review feedbacks
baptistecolle a338ce7
fix(docs): fix all the broken links
baptistecolle c0ca01b
fix(docs): fix broken index page
baptistecolle File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -135,4 +135,7 @@ dmypy.json | |
.vscode | ||
.idea/ | ||
|
||
jetstream-pt-deps | ||
jetstream-pt-deps | ||
|
||
# Optimum TPU artifacts | ||
tpu-doc-build/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
import os | ||
import sys | ||
import yaml | ||
|
||
# Check that both files exist | ||
examples_file = 'docs/scripts/examples_list.yml' | ||
toctree_file = 'docs/source/_toctree.yml' | ||
|
||
if not os.path.exists(examples_file): | ||
print(f"Error: {examples_file} does not exist") | ||
sys.exit(1) | ||
|
||
if not os.path.exists(toctree_file): | ||
print(f"Error: {toctree_file} does not exist") | ||
sys.exit(1) | ||
|
||
# Read the examples list | ||
with open(examples_file, 'r') as f: | ||
examples = yaml.safe_load(f) | ||
|
||
# Read the main toctree | ||
with open(toctree_file, 'r') as f: | ||
toc = yaml.safe_load(f) | ||
|
||
# Find the howto section and insert before more_examples | ||
# Iterate through the list to find the sections with howto | ||
for item in toc: | ||
if isinstance(item, dict) and 'sections' in item: | ||
for section in item['sections']: | ||
if isinstance(section, dict) and 'sections' in section: | ||
howto_items = section['sections'] | ||
for i, subitem in enumerate(howto_items): | ||
if subitem.get('local') == 'howto/more_examples': | ||
# Insert the new examples before this position | ||
for example in reversed(examples): | ||
howto_items.insert(i, example) | ||
break | ||
|
||
# Write back the modified toctree | ||
with open(toctree_file, 'w') as f: | ||
yaml.dump(toc, f, sort_keys=False, allow_unicode=True, default_flow_style=False) | ||
|
||
print("Added examples to the howto section of the toctree") | ||
|
||
# Print the updated toctree contents | ||
with open(toctree_file, 'r') as f: | ||
print("\nUpdated _toctree.yml contents:") | ||
print(f.read()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
- local: howto/gemma_tuning | ||
title: Gemma Fine-Tuning Example | ||
- local: howto/llama_tuning | ||
title: Llama Fine-Tuning Example |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
23 changes: 23 additions & 0 deletions
23
docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Differences between Jetstream Pytorch and PyTorch XLA | ||
|
||
This guide explains to optimum-tpu users the difference between Jetstream Pytorch and PyTorch XLA as those are two available backends in TGI. | ||
|
||
JetStream PyTorch is a high-performance inference engine built on top of PyTorch XLA. It is optimized for throughput and memory efficiency when running Large Language Models (LLMs) on TPUs. | ||
|
||
| Feature | Jetstream Pytorch | PyTorch XLA | | ||
|---------|-----------|-------------| | ||
| Training | ❌ | ✅ | | ||
| Serving | ✅ | ✅ | | ||
| Performance | Higher serving performance | Standard performance | | ||
| Flexibility | Limited to serving | Full PyTorch ecosystem | | ||
| Use Case | Production inference | Development and training | | ||
| Integration | Optimized for deployment | Standard PyTorch workflow | | ||
|
||
**Notes:** | ||
By default, optimum-tpu is using PyTorch XLA for training and Jetstream Pytorch for serving. | ||
|
||
You can configure optimum-tpu to use either version for serving with TGI. You can use the Pytorch XLA backend in TGI by setting up `-e JETSTREAM_PT_DISABLE=1` in your docker run arguments. | ||
|
||
You can find more information about: | ||
- PyTorch XLA: https://pytorch.org/xla/ and https://github.com/pytorch/xla | ||
- Jetstream Pytorch: https://github.com/AI-Hypercomputer/jetstream-pytorch |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# TPU hardware support | ||
Optimum-TPU support and is optimized for v5e and v6e TPUs. | ||
|
||
## TPU naming convention | ||
The TPU naming follows this format: `<tpu_version>-<number_of_tpus>` | ||
|
||
TPU version: | ||
- v5litepod (v5e) | ||
- v6e | ||
|
||
For example, a v5litepod-8 is a v5e TPU with 8 tpus. | ||
|
||
## Memory on TPU | ||
The HBM (High Bandwidth Memory) capacity per chip is 16GB for v5e, v5p and 32GB for v6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory | ||
|
||
## Recommended Runtime for TPU | ||
|
||
During the TPU VM creation use the following TPU VM base images for optimum-tpu: | ||
- v2-alpha-tpuv6e (TPU v6e) (recommended) | ||
- v2-alpha-tpuv5 (TPU v5p) (recommended) | ||
tengomucho marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- v2-alpha-tpuv5-lite (TPU v5e) (recommended) | ||
tengomucho marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- tpu-ubuntu2204-base (default) | ||
|
||
For installation instructions, refer to our [TPU setup tutorial](../tutorials/tpu_setup). We recommend you use the *alpha* version with optimum-tpu, as optimum-tpu is tested and optimized for those. | ||
|
||
More information at https://cloud.google.com/tpu/docs/runtimes#pytorch_and_jax | ||
|
||
# Next steps | ||
For more information on the different TPU hardware, you can look at: | ||
https://cloud.google.com/tpu/docs/v6e | ||
https://cloud.google.com/tpu/docs/v5p | ||
tengomucho marked this conversation as resolved.
Show resolved
Hide resolved
|
||
https://cloud.google.com/tpu/docs/v5e | ||
|
||
Pricing informatin can be found here https://cloud.google.com/tpu/pricing | ||
|
||
TPU availability can be found https://cloud.google.com/tpu/docs/regions-zones |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# Contributing to Optimum TPU | ||
|
||
We're excited that you're interested in contributing to Optimum TPU! Whether you're fixing bugs, adding new features, improving documentation, or sharing your experiences, your contributions are highly valued 😄 | ||
|
||
## Getting Started | ||
|
||
1. [Fork](https://github.com/huggingface/optimum-tpu/fork) and clone the repository: | ||
```bash | ||
git clone https://github.com/YOUR_USERNAME/optimum-tpu.git | ||
cd optimum-tpu | ||
``` | ||
|
||
2. Install the package locally: | ||
```bash | ||
python -m venv .venv | ||
source .venv/bin/activate | ||
python -m pip install . -f https://storage.googleapis.com/libtpu-releases/index.html | ||
``` | ||
|
||
## Development Tools | ||
|
||
The project includes a comprehensive Makefile with commands for various development tasks: | ||
|
||
### Testing | ||
```bash | ||
make tests # Run all the non-TGI-related tests | ||
make tgi_test # Run TGI tests with PyTorch/XLA | ||
make tgi_test_jetstream # Run TGI tests with Jetstream backend | ||
make tgi_docker_test # Run TGI integration tests in Docker | ||
``` | ||
|
||
### Code Quality | ||
```bash | ||
make style # Auto-fix code style issues | ||
make style_check # Check code style without fixing | ||
``` | ||
|
||
### Documentation | ||
```bash | ||
make preview_doc # Preview documentation locally | ||
``` | ||
|
||
### Docker Images | ||
```bash | ||
make tpu-tgi # Build TGI Docker image | ||
make tpu-tgi-ie # Build TGI inference endpoint image | ||
make tpu-tgi-gcp # Build TGI Google Cloud image | ||
``` | ||
|
||
### TGI Development | ||
When working on Text Generation Inference (`/text-generation-inference` folder), you might also want to build a TGI image from scratch. To do this, refer to the manual image building section of the [serving how to guide](./howto/serving) | ||
|
||
1. Build the standalone server: | ||
```bash | ||
make tgi_server | ||
``` | ||
|
||
## Pull Request Process | ||
|
||
1. Create a new branch: | ||
```bash | ||
git checkout -b your-feature-name | ||
``` | ||
|
||
2. Make your changes | ||
|
||
3. Run tests: | ||
```bash | ||
make tests | ||
# Run more specialized test if needed such as make tgi_test, make tgi_test_jetstream, make tgi_docker_test | ||
make style_check | ||
``` | ||
|
||
4. Submit your PR with: | ||
- Clear description of changes | ||
- Test results | ||
- Documentation updates if needed | ||
|
||
## Need Help? | ||
|
||
- Check the [documentation](https://huggingface.co/docs/optimum/tpu/overview) | ||
- Open an issue for bugs or feature requests | ||
|
||
## License | ||
|
||
By contributing to Optimum TPU, you agree that your contributions will be licensed under the Apache License, Version 2.0. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# Advanced TGI Server Configuration | ||
|
||
## Jetstream Pytorch and Pytorch XLA backends | ||
|
||
[Jetstream Pytorch](https://github.com/AI-Hypercomputer/jetstream-pytorch) is a highly optimized Pytorch engine for serving LLMs on Cloud TPU. This engine is selected by default if the dependency is available. | ||
|
||
We recommend using Jetstream with TGI for the best performance. If for some reason you want to use the Pytorch/XLA backend instead, you can set the `JETSTREAM_PT_DISABLE=1` environment variable. | ||
|
||
For more information, see our discussion on the [difference between jetstream and pytorch XLA](../conceptual_guides/difference_between_jetstream_and_xla) | ||
|
||
## Quantization | ||
When using Jetstream Pytorch engine, it is possible to enable quantization to reduce the memory footprint and increase the throughput. To enable quantization, set the `QUANTIZATION=1` environment variable. For instance, on a 2x4 TPU v5e (16GB per chip * 8 = 128 GB per pod), you can serve models up to 70B parameters, such as Llama 3.3-70B. The quantization is done in `int8` on the fly as the weight loads. As with any quantization option, you can expect a small drop in the model accuracy. Without the quantization option enabled, the model is served in bf16. | ||
|
||
## How to solve memory requirements | ||
|
||
If you encounter `Backend(NotEnoughMemory(2048))`, here are some solutions that could help with reducing memory usage in TGI: | ||
|
||
**Optimum-TPU specific arguments:** | ||
- `-e QUANTIZATION=1`: To enable quantization. This should reduce memory requirements by almost half | ||
- `-e MAX_BATCH_SIZE=n`: You can manually reduce the size of the batch size | ||
|
||
**TGI specific arguments:** | ||
- `--max-input-length`: Maximum input sequence length | ||
- `--max-total-tokens`: Maximum combined input and output tokens | ||
- `--max-batch-prefill-tokens`: Maximum tokens for batch processing | ||
- `--max-batch-total-tokens`: Maximum total tokens in a batch | ||
|
||
To reduce memory usage, you can try smaller values for `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`. | ||
|
||
<Tip warning={true}> | ||
`max-batch-prefill-tokens ≤ max-input-length * max_batch_size`. Otherwise, you will have an error as the configuration does not make sense. If the max-batch-prefill-tokens were bigger, then you would not be able to process any request | ||
</Tip> | ||
baptistecolle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Sharding | ||
Sharding is done automatically by the TGI server, so your model uses all the TPUs that are available. We do tensor parallelism, so the layers are automatically split in all available TPUs. However, the TGI router will only see one shard. | ||
|
||
More information on tensor parralelsim can be found here https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism. | ||
|
||
## Understanding the configuration | ||
baptistecolle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Key parameters explained: | ||
|
||
**Required parameters** | ||
- `--shm-size 16GB`: Increase default shared memory allocation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❤️ |
||
- `--privileged`: Required for TPU access. | ||
- `--net host`: Uses host network mode. | ||
Those are needed to run a TPU container so that the container can properly access the TPU hardware. | ||
|
||
**Optional parameters** | ||
- `-v ~/hf_data:/data`: Volume mount for model storage, this allows you to not have to re-download the models weights on each startup. You can use any folder you would like as long as it maps back to /data. | ||
- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production). | ||
Those are parameters used by TGI and optimum-TPU to configure the server behavior. | ||
|
||
|
||
<Tip warning={true}> | ||
`--privileged --shm-size 16GB --net host` is required as specify in https://github.com/pytorch/xla | ||
</Tip> | ||
|
||
## Next steps | ||
Please check the [TGI docs](https://huggingface.co/docs/text-generation-inference) for more TGI server configuration options. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full st... oh, whatever