Skip to content

Commit

Permalink
Merge pull request #90 from runpod-workers/pandyamarut-patch-1
Browse files Browse the repository at this point in the history
Update README.md
  • Loading branch information
pandyamarut authored Aug 2, 2024
2 parents 8a010c3 + e846eca commit 66ed2a2
Showing 1 changed file with 6 additions and 73 deletions.
79 changes: 6 additions & 73 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https:
### 1. UI for Deploying vLLM Worker on RunPod console:
![Demo of Deploying vLLM Worker on RunPod console with new UI](media/ui_demo.gif)

### 2. Worker vLLM `1.0.0` with vLLM `0.4.2` now available under `stable` tags
Update 1.0.0 is now available, use the image tag `runpod/worker-vllm:stable-cuda12.1.0` or `runpod/worker-vllm:stable-cuda11.8.0`.
### 2. Worker vLLM `v1.1` with vLLM `0.5.3` now available under `stable` tags
Update v1.1 is now available, use the image tag `runpod/worker-v1-vllm:stable-cuda12.1.0`.

### 3. OpenAI-Compatible [Embedding Worker](https://github.com/runpod-workers/worker-infinity-embedding) Released
Deploy your own OpenAI-compatible Serverless Endpoint on RunPod with multiple embedding models and fast inference for RAG and more!
Expand Down Expand Up @@ -52,7 +52,6 @@ Worker vLLM is now cached on all RunPod machines, resulting in near-instant depl
- [Modifying your OpenAI Codebase to use your deployed vLLM Worker](#modifying-your-openai-codebase-to-use-your-deployed-vllm-worker)
- [OpenAI Request Input Parameters](#openai-request-input-parameters)
- [Chat Completions](#chat-completions)
- [Completions](#completions)
- [Examples: Using your RunPod endpoint with OpenAI](#examples-using-your-runpod-endpoint-with-openai)
- [Usage: standard](#non-openai-usage)
- [Input Request Parameters](#input-request-parameters)
Expand All @@ -78,8 +77,7 @@ Below is a summary of the available RunPod Worker images, categorized by image s

| CUDA Version | Stable Image Tag | Development Image Tag | Note |
|--------------|-----------------------------------|-----------------------------------|----------------------------------------------------------------------|
| 11.8.0 | `runpod/worker-vllm:stable-cuda11.8.0` | `runpod/worker-vllm:dev-cuda11.8.0` | Available on all RunPod Workers without additional selection needed. |
| 12.1.0 | `runpod/worker-vllm:stable-cuda12.1.0` | `runpod/worker-vllm:dev-cuda12.1.0` | When creating an Endpoint, select CUDA Version 12.3, 12.2 and 12.1 in the filter. |
| 12.1.0 | `runpod/worker-v1-vllm:stable-cuda12.1.0` | `runpod/worker-v1-vllm:dev-cuda12.1.0` | When creating an Endpoint, select CUDA Version 12.3, 12.2 and 12.1 in the filter. |



Expand All @@ -93,7 +91,7 @@ Below is a summary of the available RunPod Worker images, categorized by image s
| `Name` | `Default` | `Type/Choices` | `Description` |
|-------------------------------------------|-----------------------|--------------------------------------------|---------------|
| `MODEL` | 'facebook/opt-125m' | `str` | Name or path of the Hugging Face model to use. |
| `MODEL_NAME` | 'facebook/opt-125m' | `str` | Name or path of the Hugging Face model to use. |
| `TOKENIZER` | None | `str` | Name or path of the Hugging Face tokenizer to use. |
| `SKIP_TOKENIZER_INIT` | False | `bool` | Skip initialization of tokenizer and detokenizer. |
| `TOKENIZER_MODE` | 'auto' | ['auto', 'slow'] | The tokenizer mode. |
Expand Down Expand Up @@ -269,7 +267,7 @@ Below are all supported model architectures (and examples of each) that you can
- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.)

# Usage: OpenAI Compatibility
The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins>, <ins>Completions</ins> and <ins>Models</ins> - with both streaming and non-streaming.
The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins> and <ins>Models</ins> - with both streaming and non-streaming.

## Modifying your OpenAI Codebase to use your deployed vLLM Worker
**Python** (similar to Node.js, etc.):
Expand Down Expand Up @@ -351,7 +349,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a

When using the chat completion feature of the vLLM Serverless Endpoint Worker, you can customize your requests with the following parameters:

### Chat Completions
### Chat Completions [RECOMMENDED]
<details>
<summary>Supported Chat Completions Inputs and Descriptions</summary>

Expand Down Expand Up @@ -386,41 +384,6 @@ When using the chat completion feature of the vLLM Serverless Endpoint Worker, y
| `include_stop_str_in_output` | Optional[bool] | False | Whether to include the stop strings in output text. Defaults to False.|
</details>

### Completions
<details>
<summary>Supported Completions Inputs and Descriptions</summary>

| Parameter | Type | Default Value | Description |
|--------------------------------|----------------------------------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `model` | str | | The model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your RunPod endpoint with OpenAI** section. |
| `prompt` | Union[List[int], List[List[int]], str, List[str]] | | A string, array of strings, array of tokens, or array of token arrays to be used as the input for the model. |
| `suffix` | Optional[str] | None | A string to be appended to the end of the generated text. |
| `max_tokens` | Optional[int] | 16 | Maximum number of tokens to generate per output sequence. |
| `temperature` | Optional[float] | 1.0 | Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. |
| `top_p` | Optional[float] | 1.0 | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| `n` | Optional[int] | 1 | Number of output sequences to return for the given prompt. |
| `stream` | Optional[bool] | False | Whether to stream the output. |
| `logprobs` | Optional[int] | None | Number of log probabilities to return per output token. |
| `echo` | Optional[bool] | False | Whether to echo back the prompt in addition to the completion. |
| `stop` | Optional[Union[str, List[str]]] | list | List of strings that stop the generation when they are generated. The returned output will not contain the stop strings. |
| `seed` | Optional[int] | None | Random seed to use for the generation. |
| `presence_penalty` | Optional[float] | 0.0 | Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. |
| `frequency_penalty` | Optional[float] | 0.0 | Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. |
| `best_of` | Optional[int] | None | Number of output sequences that are generated from the prompt. From these `best_of` sequences, the top `n` sequences are returned. `best_of` must be greater than or equal to `n`. This parameter influences the diversity of the output. |
| `logit_bias` | Optional[Dict[str, float]] | None | Dictionary of token IDs to biases. |
| `user` | Optional[str] | None | User identifier for personalizing responses. (Unsupported by vLLM) |
Additional parameters supported by vLLM:
| `top_k` | Optional[int] | -1 | Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens. |
| `ignore_eos` | Optional[bool] | False | Whether to ignore the End Of Sentence token and continue generating tokens after the EOS token is generated. |
| `use_beam_search` | Optional[bool] | False | Whether to use beam search instead of sampling for generating outputs. |
| `stop_token_ids` | Optional[List[int]] | list | List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens. |
| `skip_special_tokens` | Optional[bool] | True | Whether to skip special tokens in the output. |
| `spaces_between_special_tokens`| Optional[bool] | True | Whether to add spaces between special tokens in the output. Defaults to True. |
| `repetition_penalty` | Optional[float] | 1.0 | Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. |
| `min_p` | Optional[float] | 0.0 | Float that represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. |
| `length_penalty` | Optional[float] | 1.0 | Float that penalizes sequences based on their length. Used in beam search. |
| `include_stop_str_in_output` | Optional[bool] | False | Whether to include the stop strings in output text. Defaults to False. |
</details>

## Examples: Using your RunPod endpoint with OpenAI

Expand Down Expand Up @@ -465,36 +428,6 @@ This is the format used for GPT-4 and focused on instruction-following and chat.
print(response.choices[0].message.content)
```


### Completions:
This is the format used for models like GPT-3 and is meant for completing the text you provide. Instead of responding to your message, it will try to complete it. Examples of Open Source completions models include `meta-llama/Llama-2-7b-hf`, `mistralai/Mixtral-8x7B-v0.1`, `Qwen/Qwen-72B`, and more. However, you can use any model with this format.
- **Streaming**:
```python
# Create a completion stream
response_stream = client.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
prompt="Runpod is the best platform because",
temperature=0,
max_tokens=100,
stream=True,
)
# Stream the response
for response in response_stream:
print(response.choices[0].text or "", end="", flush=True)
```
- **Non-Streaming**:
```python
# Create a completion
response = client.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
prompt="Runpod is the best platform because",
temperature=0,
max_tokens=100,
)
# Print the response
print(response.choices[0].text)
```

### Getting a list of names for available models:
In the case of baking the model into the image, sometimes the repo may not be accepted as the `model` in the request. In this case, you can list the available models as shown below and use that name.
```python
Expand Down

0 comments on commit 66ed2a2

Please sign in to comment.