Skip to content

Commit e4c5845

Browse files
danielholandajeremyfowersvgodsoe
authoredMar 20, 2025··
Release v6.0.3 (#295)
Co-authored-by: Jeremy Fowers <80718789+jeremyfowers@users.noreply.github.com> Co-authored-by: Victoria Godsoe <victoria.godsoe@amd.com>
1 parent d750f4a commit e4c5845

35 files changed

+736
-410
lines changed
 

‎.github/workflows/publish-to-test-pypi.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ on:
77
- v*
88
- RC*
99
pull_request:
10-
branches: ["main", "canary", "refresh"]
10+
branches:
11+
- '**'
1112

1213
jobs:
1314
build-n-publish:

‎.github/workflows/server_installer_windows_latest.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ on:
66
tags:
77
- v*
88
pull_request:
9-
branches: ["main"]
9+
branches:
10+
- '**'
1011
workflow_dispatch:
1112

1213
jobs:

‎.github/workflows/test_lemonade.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ on:
77
push:
88
branches: ["main"]
99
pull_request:
10-
branches: ["main"]
10+
branches:
11+
- '**'
1112

1213
permissions:
1314
contents: read

‎.github/workflows/test_lemonade_oga_cpu.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ on:
77
push:
88
branches: ["main"]
99
pull_request:
10-
branches: ["main"]
10+
branches:
11+
- '**'
1112

1213
permissions:
1314
contents: read

‎.github/workflows/test_quark.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ on:
77
push:
88
branches: ["main"]
99
pull_request:
10-
branches: ["main"]
10+
branches:
11+
- '**'
1112

1213
permissions:
1314
contents: read

‎.github/workflows/test_server.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ on:
77
push:
88
branches: ["main"]
99
pull_request:
10-
branches: ["main"]
10+
branches:
11+
- '**'
1112

1213
permissions:
1314
contents: read

‎docs/contribute.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -152,9 +152,8 @@ TurnkeyML is provided as a package on PyPI, the Python Package Index, as [turnke
152152
The following public APIs are available for developers. The maintainers aspire to change these as infrequently as possible, and doing so will require an update to the package's major version number.
153153

154154
- From the top-level `__init__.py`:
155-
- `turnkeycli`: the `main()` function of the `turnkey` CLI
156-
- `evaluate_files()`: the top-level API called by the CLI
157155
- `turnkeyml.version`: The package version number
156+
- `State` class and `load_state`: structure that holds build state between Tools; function to load `State` from disk.
158157
- From the `common.filesystem` module:
159158
- `get_available_builds()`: list the builds in a turnkey cache
160159
- `make_cache_dir()`: create a turnkey cache

‎docs/lemonade/getting_started.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ To install `lemonade` from source code:
6060

6161
## From Lemonade_Server_Installer.exe
6262

63-
The `lemonade` server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.
63+
The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.
6464

6565
# CLI Commands
6666

‎docs/lemonade/lemonade_server_exe.md

+7-103
Original file line numberDiff line numberDiff line change
@@ -1,117 +1,21 @@
11
# Lemonade Server Installer
22

3-
The `lemonade` server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.
3+
The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality.
44

5-
## GUI Installation and Usage
5+
## GUI Installation
66

77
> *Note:* you may need to give your browser or OS permission to download or install the .exe.
88
99
1. Navigate to the [latest release](https://github.com/onnx/turnkeyml/releases/latest).
1010
1. Scroll to the bottom and click `Lemonade_Server_Installer.exe` to download.
1111
1. Double-click the `Lemonade_Server_Installer.exe` and follow the instructions.
1212

13-
Now that you have the server installed, you can double click the desktop shortcut to run the server process. From there, you can connect it to applications that are compatible with the OpenAI completions API.
13+
## Usage
1414

15-
## Silent Installation and Command Line Usage
15+
Now that you have the server installed, you can double click the desktop shortcut to run the server process.
1616

17-
Silent installation and command line usage are useful if you want to fully integrate `lemonade` server into your own application. This guide provides fully automated steps for downloading, installing, and running `lemonade` server so that your users don't have to install `lemonade` separately.
17+
From there, you can connect it to applications that are compatible with the OpenAI completions API. The Lemonade Server [examples folder](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) has guides for how to use Lemonade Server with a collection of applications that we have tested.
1818

19-
Definitions:
20-
- "Silent installation" refers to an automatic command for installing `lemonade` server without running any GUI or prompting the user for any questions. It does assume that the end-user fully accepts the license terms, so be sure that your own application makes this clear to the user.
21-
- Command line usage allows the server process to be launched programmatically, so that your application can manage starting and stopping the server process on your user's behalf.
19+
## Developing with Lemonade Server
2220

23-
### Download
24-
25-
Follow these instructions to download a copy of `Lemonade_Server_Installer.exe`.
26-
27-
#### cURL Download
28-
29-
In a `bash` terminal, such as `git bash`:
30-
31-
Download the latest version:
32-
33-
```bash
34-
curl -L -o ".\Lemonade_Server_Installer.exe" https://github.com/onnx/turnkeyml/releases/latest/download/Lemonade_Server_Installer.exe
35-
```
36-
37-
Download a specific version:
38-
39-
```bash
40-
curl -L -o ".\Lemonade_Server_Installer.exe" https://github.com/onnx/turnkeyml/releases/download/v6.0.0/Lemonade_Server_Installer.exe
41-
```
42-
43-
#### PowerShell Download
44-
45-
In a powershell terminal:
46-
47-
Download the latest version:
48-
49-
```powershell
50-
Invoke-WebRequest -Uri "https://github.com/onnx/turnkeyml/releases/latest/download/Lemonade_Server_Installer.exe" -OutFile "Lemonade_Server_Installer.exe"
51-
```
52-
53-
Download a specific version:
54-
55-
```powershell
56-
Invoke-WebRequest -Uri "https://github.com/onnx/turnkeyml/releases/download/v6.0.0/Lemonade_Server_Installer.exe" -OutFile "Lemonade_Server_Installer.exe"
57-
```
58-
59-
### Silent Installation
60-
61-
Silent installation runs `Lemonade_Server_Installer.exe` without a GUI and automatically accepts all prompts.
62-
63-
In a `cmd.exe` terminal:
64-
65-
Install *with* Ryzen AI hybrid support:
66-
67-
```bash
68-
Lemonade_Server_Installer.exe /S /Extras=hybrid
69-
```
70-
71-
Install *without* Ryzen AI hybrid support:
72-
73-
```bash
74-
Lemonade_Server_Installer.exe /S
75-
```
76-
77-
The install directory can also be changed from the default by using `/D` as the last argument.
78-
79-
For example:
80-
81-
```bash
82-
Lemonade_Server_Installer.exe /S /Extras=hybrid /D=C:\a\new\path`
83-
```
84-
85-
### Command Line Invocation
86-
87-
Command line invocation starts the `lemonade` server process so that your application can connect to it via REST API endpoints.
88-
89-
#### Foreground Process
90-
91-
These steps will open lemonade server in a terminal window that is visible to users. The user can exit the server by closing the window.
92-
93-
In a `cmd.exe` terminal:
94-
95-
```bash
96-
conda run --no-capture-output -p INSTALL_DIR\lemonade_server\lemon_env lemonade serve
97-
```
98-
99-
Where `INSTALL_DIR` is the installation path of `lemonade_server`.
100-
101-
For example, if you used the default installation directory and your username is USERNAME:
102-
103-
```bash
104-
C:\Windows\System32\cmd.exe /C conda run --no-capture-output -p C:\Users\USERNAME\AppData\Local\lemonade_server\lemon_env lemonade serve
105-
```
106-
107-
#### Background Process
108-
109-
This command will open lemonade server without opening a window. Your application needs to manage terminating the process and any child processes it creates.
110-
111-
In a powershell terminal:
112-
113-
```powershell
114-
$serverProcess = Start-Process -FilePath "C:\Windows\System32\cmd.exe" -ArgumentList "/C conda run --no-capture-output -p INSTALL_DIR\lemonade_server\lemon_env lemonade serve" -RedirectStandardOutput lemonade_out.txt -RedirectStandardError lemonade_err.txt -PassThru -NoNewWindow
115-
```
116-
117-
Where `INSTALL_DIR` is the installation path of `lemonade_server`.
21+
Interested in integrating Lemonade Server into an application you are developing? Check out the [Lemonade Server integration guide](server_integration.md) to learn more.

‎docs/lemonade/mmlu_accuracy.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,4 +99,4 @@ Use the syntax provided in the table to run that test subject with the `accuracy
9999
| Sociology | Culture | sociology |
100100
| US Foreign Policy | Politics | us_foreign_policy |
101101
| Virology | Health | virology |
102-
| World Religions | Philosophy | world_religions |
102+
| World Religions | Philosophy | world_religions |

‎docs/lemonade/server_integration.md

+131
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Integrating with Lemonade Server
2+
3+
This guide provides instructions on how to integrate Lemonade Server into your application.
4+
5+
There are two main ways in which Lemonade Sever might integrate into apps:
6+
* User-Managed Server: User is responsible for installing and managing Lemonade Server.
7+
* App-Managed Server: App is responsible for installing and managing Lemonade Server on behalf of the user.
8+
9+
The first part of this guide contains instructions that are common for both integration approaches. The second part provides advanced instructions only needed for app-managed server integrations.
10+
11+
## General Instructions
12+
13+
14+
### Identifying Compatible Devices
15+
16+
AMD Ryzen™ AI `Hybrid` models are available on Windows 11 on all AMD Ryzen™ AI 300 Series Processors. To programmatically identify supported devices, we recommend using a regular expression that checks if the CPU name contains "Ryzen AI" and a 3-digit number starting with 3 as shown below.
17+
18+
```
19+
Ryzen AI.*\b3\d{2}\b
20+
```
21+
22+
Explanation:
23+
- `Ryzen AI`: Matches the literal phrase "Ryzen AI".
24+
- `.*`: Allows any characters (including spaces) to appear after "Ryzen AI".
25+
- `\b3\d{2}\b`: Matches a three-digit number starting with 3, ensuring it's a standalone number.
26+
27+
There are several ways to check the CPU name on a Windows computer. A reliable way of doing so is through cmd's `reg query` command as shown below.
28+
29+
```
30+
reg query "HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor\0" /v ProcessorNameString
31+
```
32+
33+
### Downloading Server Installer
34+
35+
The recommended way of directing users to the server installer is pointing users to our releases page at [`https://github.com/onnx/turnkeyml/releases`](https://github.com/onnx/turnkeyml/releases). Alternatively, you may also provide the direct path to the installer itself or download the installer programmatically as shown below:
36+
37+
38+
Latest version:
39+
40+
```bash
41+
https://github.com/onnx/turnkeyml/releases/latest/download/Lemonade_Server_Installer.exe
42+
```
43+
44+
Specific version:
45+
46+
```bash
47+
https://github.com/onnx/turnkeyml/releases/download/v6.0.0/Lemonade_Server_Installer.exe
48+
```
49+
50+
Please note that the Server Installer is only available on Windows. Apps that integrate with our server on a Linux machine must install Lemonade from source as described [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md#from-source-code).
51+
52+
## Stand-Alone Server Integration
53+
54+
Some apps might prefer to be responsible for installing and managing Lemonade Server on behalf of the user. This part of the guide includes steps for installing and running Lemonade Server so that your users don't have to install Lemonade Server separately.
55+
56+
Definitions:
57+
- "Silent installation" refers to an automatic command for installing Lemonade Server without running any GUI or prompting the user for any questions. It does assume that the end-user fully accepts the license terms, so be sure that your own application makes this clear to the user.
58+
- Command line usage allows the server process to be launched programmatically, so that your application can manage starting and stopping the server process on your user's behalf.
59+
60+
### Silent Installation
61+
62+
Silent installation runs `Lemonade_Server_Installer.exe` without a GUI and automatically accepts all prompts.
63+
64+
In a `cmd.exe` terminal:
65+
66+
Install *with* Ryzen AI hybrid support:
67+
68+
```bash
69+
Lemonade_Server_Installer.exe /S /Extras=hybrid
70+
```
71+
72+
Install *without* Ryzen AI hybrid support:
73+
74+
```bash
75+
Lemonade_Server_Installer.exe /S
76+
```
77+
78+
The install directory can also be changed from the default by using `/D` as the last argument.
79+
80+
For example:
81+
82+
```bash
83+
Lemonade_Server_Installer.exe /S /Extras=hybrid /D=C:\a\new\path
84+
```
85+
86+
Only `Qwen2.5-0.5B-Instruct-CPU` is installed by default in silent mode. If you wish to select additional models to download in silent mode, you may use the `/Models` argument.
87+
88+
```bash
89+
Lemonade_Server_Installer.exe /S /Extras=hybrid /Models="Qwen2.5-0.5B-Instruct-CPU Llama-3.2-1B-Instruct-Hybrid"
90+
```
91+
92+
The available modes are the following:
93+
* `Qwen2.5-0.5B-Instruct-CPU`
94+
* `Llama-3.2-1B-Instruct-Hybrid`
95+
* `Llama-3.2-3B-Instruct-Hybrid`
96+
* `Phi-3-Mini-Instruct-Hybrid`
97+
* `Qwen-1.5-7B-Chat-Hybrid`
98+
99+
### Command Line Invocation
100+
101+
Command line invocation starts the Lemonade Server process so that your application can connect to it via REST API endpoints.
102+
103+
#### Foreground Process
104+
105+
These steps will open the Lemonade Server in a terminal window that is visible to users. The user can exit the server by closing the window.
106+
107+
In a `cmd.exe` terminal:
108+
109+
```bash
110+
conda run --no-capture-output -p INSTALL_DIR\lemonade_server\lemon_env lemonade serve
111+
```
112+
113+
Where `INSTALL_DIR` is the installation path of `lemonade_server`.
114+
115+
For example, if you used the default installation directory and your username is USERNAME:
116+
117+
```bash
118+
C:\Windows\System32\cmd.exe /C conda run --no-capture-output -p C:\Users\USERNAME\AppData\Local\lemonade_server\lemon_env lemonade serve
119+
```
120+
121+
#### Background Process
122+
123+
This command will open the Lemonade Server without opening a window. Your application needs to manage terminating the process and any child processes it creates.
124+
125+
In a powershell terminal:
126+
127+
```powershell
128+
$serverProcess = Start-Process -FilePath "C:\Windows\System32\cmd.exe" -ArgumentList "/C conda run --no-capture-output -p INSTALL_DIR\lemonade_server\lemon_env lemonade serve" -RedirectStandardOutput lemonade_out.txt -RedirectStandardError lemonade_err.txt -PassThru -NoNewWindow
129+
```
130+
131+
Where `INSTALL_DIR` is the installation path of `lemonade_server`.

‎docs/lemonade/server_spec.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ We are also actively investigating and developing [additional endpoints](#additi
99
### OpenAI-Compatible Endpoints
1010
- POST `/api/v0/chat/completions` - Chat Completions (messages -> completion)
1111
- POST `/api/v0/completions` - Text Completions (prompt -> completion)
12-
- GET `/api/v0/models` - List available models
12+
- GET `/api/v0/models` - List models available locally
1313

1414
### Additional Endpoints
1515

@@ -165,7 +165,7 @@ The following format is used for both streaming and non-streaming responses:
165165

166166
### `GET /api/v0/models` <sub>![Status](https://img.shields.io/badge/status-fully_available-green)</sub>
167167

168-
Returns a list of key models available on the server in an OpenAI-compatible format. This list is curated based on what works best for Ryzen AI Hybrid. Additional models can be loaded via the `/api/v0/load` endpoint by specifying the Hugging Face checkpoint.
168+
Returns a list of key models available on the server in an OpenAI-compatible format. This list is curated based on what works best for Ryzen AI Hybrid. Only models available locally are shown.
169169

170170
#### Parameters
171171

‎examples/lemonade/README.md

+9-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
# Lemonade Examples
22

3-
This folder contains examples of how to use `lemonade` via the high-level APIs. These APIs make it easy to load a model, generate responses, and also show how to stream those responses.
3+
This folder contains examples of how to deploy `lemonade` into applications.
4+
5+
## Server Examples
6+
7+
The `server/` folder contains examples of how to use Lemonade Server with existing applications that support server interfaces. Learn more in `server/README.md`.
8+
9+
## API Examples
10+
11+
This folder has examples of using the Lemonade API to integrate LLMs into Python applications. These APIs make it easy to load a model, generate responses, and also show how to stream those responses.
412

513
The `demos/` folder also contains some higher-level application demos of the APIs. Learn more in `demos/README.md`.
614

‎examples/lemonade/server/README.md

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Lemonade Server Examples
2+
3+
The guides in this folder help you connect Lemoande Server to applications.
4+
5+
| App | Guide |
6+
|--------------------|-------------------------------------------------------------------------------------------------------|
7+
| [Open WebUI](https://github.com/open-webui/open-webui) | [How chat with lemonade LLMs in Open WebUI](https://ryzenai.docs.amd..com/en/latest/llm/server_interface.html#open-webui-demo) |
8+
| [Continue](https://www.continue.dev/) | [How use lemonade LLMs as a coding assistant in Continue](continue.md) |

‎examples/lemonade/server/continue.md

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Continue Coding Assistant
2+
3+
## Overview
4+
5+
[Continue](https://www.continue.dev/) is a coding assistant that lives inside of a VS Code extension. It supports chatting with your codebase, making edits, and a lot more.
6+
7+
## Expectations
8+
9+
We have found that the `Qwen-1.5-7B-Chat-Hybrid` model is the best Hybrid model available for coding. It is good at chatting with a few files at a time in your codebase to learn more about them. It can also make simple code editing suggestions pertaining to a few lines of code at a time.
10+
11+
However, we do not recommend using this model for analyzing large codebases at once or making large or complex file edits.
12+
13+
## Setup
14+
15+
### Prerequisites
16+
17+
1. Install Lemonade Server using the [installer .exe](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/lemonade_server_exe.md#lemonade-server-installer).
18+
19+
### Install Continue
20+
21+
> Note: they provide their own instructions [here](https://marketplace.visualstudio.com/items?itemName=Continue.continue)
22+
23+
1. Open the Extensions tab in VS Code Activity Bar.
24+
1. Search "Continue - Codestral, Claude, and more" in the Extensions Marketplace search bar.
25+
1. Select the Continue extension and click install.
26+
27+
This will add a Continue tab to your VS Code Activity Bar.
28+
29+
### Add Lemonade Server to Continue
30+
31+
> Note: The following instructions are based on instructions from Continue found [here](https://docs.continue.dev/customize/model-providers/openai#openai-compatible-servers--apis)
32+
33+
1. Open the Continue tab in your VS Code Activity Bar.
34+
1. Click the gear icon at the top to open Settings.
35+
1. Under "Configuration", click "Open Config File".
36+
1. Replace the "models" key in the `config.json` with the following and save:
37+
38+
```json
39+
"models": [
40+
{
41+
"title": "Lemonade",
42+
"provider": "openai",
43+
"model": "Qwen-1.5-7B-Chat-Hybrid",
44+
"apiKey": "-",
45+
"apiBase": "http://localhost:8000/api/v0"
46+
}
47+
],
48+
```
49+
50+
## Usage
51+
52+
> Note: see the Continue [user guide](https://docs.continue.dev/) to learn about all of their features.
53+
54+
To try out Continue:
55+
- Open the Continue tab in your VS Code Activity Bar, and in the "Ask anything" box, type a question about your code. Use the `@` symbol to specify a file or too.
56+
- Example: "What's the fastest way to install lemonade in @getting_started.md?"
57+
- Open a file, select some code, and push Ctrl+I to start a chat about editing that code.

‎examples/readme.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@
33
This directory contains examples to help you learn how to use the tools. The examples are split up into these sub-directories:
44
1. `examples/lemonade`: scripts that demonstrate the `lemonade` CLI for LLMs.
55
1. `examples/turnkey/cli`: a tutorial series for the `turnkey` CLI. This is the recommended starting point.
6-
1. `examples/turnkey/api`: scripts that demonstrate how to use the `turnkey.evaluate_files()` API.
6+
1. `examples/turnkey/api`: scripts that demonstrate how to use the `turnkey.files_api.evaluate_files()` API.
77

‎installer/Installer.nsi

+123-3
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,54 @@ Section "Install Ryzen AI Hybrid Execution" HybridSec
236236
end:
237237
SectionEnd
238238

239+
SubSection /e "Selected Models" ModelsSec
240+
Section /o "Qwen2.5-0.5B-Instruct-CPU" Qwen05Sec
241+
SectionIn 1
242+
AddSize 999604 ;
243+
StrCpy $9 "$9Qwen2.5-0.5B-Instruct-CPU "
244+
SectionEnd
245+
246+
Section "Llama-3.2-1B-Instruct-Hybrid" Llama1BSec
247+
SectionIn 1
248+
AddSize 1884397 ;
249+
StrCpy $9 "$9Llama-3.2-1B-Instruct-Hybrid "
250+
SectionEnd
251+
252+
Section "Llama-3.2-3B-Instruct-Hybrid" Llama3BSec
253+
SectionIn 1
254+
AddSize 4268402 ;
255+
StrCpy $9 "$9Llama-3.2-3B-Instruct-Hybrid "
256+
SectionEnd
257+
258+
Section /o "Phi-3-Mini-Instruct-Hybrid" PhiSec
259+
SectionIn 1
260+
AddSize 4185551 ;
261+
StrCpy $9 "$9Phi-3-Mini-Instruct-Hybrid "
262+
SectionEnd
263+
264+
Section /o "Qwen-1.5-7B-Chat-Hybrid" Qwen7BSec
265+
SectionIn 1
266+
AddSize 8835894 ;
267+
StrCpy $9 "$9Qwen-1.5-7B-Chat-Hybrid "
268+
SectionEnd
269+
270+
Section "-Download Models" DownloadModels
271+
${If} ${Silent}
272+
${GetParameters} $CMDLINE
273+
${GetOptions} $CMDLINE "/Models=" $R0
274+
${If} $R0 != ""
275+
nsExec::ExecToLog 'conda run --no-capture-output -p $INSTDIR\$LEMONADE_CONDA_ENV lemonade-install --models $R0'
276+
${Else}
277+
; Otherwise, only the default CPU model will be installed
278+
nsExec::ExecToLog 'conda run --no-capture-output -p $INSTDIR\$LEMONADE_CONDA_ENV lemonade-install --models Qwen2.5-0.5B-Instruct-CPU'
279+
${EndIf}
280+
${Else}
281+
nsExec::ExecToLog 'conda run --no-capture-output -p $INSTDIR\$LEMONADE_CONDA_ENV lemonade-install --models $9'
282+
${EndIf}
283+
SectionEnd
284+
285+
SubSectionEnd
286+
239287
Section "-Add Desktop Shortcut" ShortcutSec
240288
; Create a desktop shortcut that passes the conda environment name as a parameter
241289
CreateShortcut "$DESKTOP\lemonade-server.lnk" "$INSTDIR\run_server.bat" "$LEMONADE_CONDA_ENV" "$INSTDIR\img\favicon.ico"
@@ -259,12 +307,68 @@ FunctionEnd
259307
!define MUI_FINISHPAGE_RUN_TEXT "Run Lemonade Server"
260308

261309
Function .onSelChange
310+
; Check hybrid selection status
262311
StrCpy $HYBRID_SELECTED "false"
263312
SectionGetFlags ${HybridSec} $0
264313
IntOp $0 $0 & ${SF_SELECTED}
265-
StrCmp $0 ${SF_SELECTED} 0 +2
314+
StrCmp $0 ${SF_SELECTED} 0 hybrid_disabled
266315
StrCpy $HYBRID_SELECTED "true"
267-
;MessageBox MB_OK "Component 2 is selected"
316+
317+
; If hybrid is enabled, check if at least one hybrid model is selected
318+
SectionGetFlags ${Llama1BSec} $1
319+
IntOp $1 $1 & ${SF_SELECTED}
320+
${If} $1 == ${SF_SELECTED}
321+
Goto end
322+
${EndIf}
323+
324+
SectionGetFlags ${Llama3BSec} $1
325+
IntOp $1 $1 & ${SF_SELECTED}
326+
${If} $1 == ${SF_SELECTED}
327+
Goto end
328+
${EndIf}
329+
330+
SectionGetFlags ${PhiSec} $1
331+
IntOp $1 $1 & ${SF_SELECTED}
332+
${If} $1 == ${SF_SELECTED}
333+
Goto end
334+
${EndIf}
335+
336+
SectionGetFlags ${Qwen7BSec} $1
337+
IntOp $1 $1 & ${SF_SELECTED}
338+
${If} $1 == ${SF_SELECTED}
339+
Goto end
340+
${EndIf}
341+
342+
; If no hybrid model is selected, select Llama-1B by default
343+
SectionGetFlags ${Llama1BSec} $1
344+
IntOp $1 $1 | ${SF_SELECTED}
345+
SectionSetFlags ${Llama1BSec} $1
346+
MessageBox MB_OK "At least one hybrid model must be selected when hybrid execution is enabled. Llama-3.2-1B-Instruct-Hybrid has been automatically selected."
347+
Goto end
348+
349+
hybrid_disabled:
350+
; When hybrid is disabled, select Qwen2.5-0.5B-Instruct-CPU and disable all other hybrid model selections
351+
SectionGetFlags ${Qwen05Sec} $1
352+
IntOp $1 $1 | ${SF_SELECTED}
353+
SectionSetFlags ${Qwen05Sec} $1
354+
355+
SectionGetFlags ${Llama1BSec} $1
356+
IntOp $1 $1 & ${SECTION_OFF}
357+
SectionSetFlags ${Llama1BSec} $1
358+
359+
SectionGetFlags ${Llama3BSec} $1
360+
IntOp $1 $1 & ${SECTION_OFF}
361+
SectionSetFlags ${Llama3BSec} $1
362+
363+
SectionGetFlags ${PhiSec} $1
364+
IntOp $1 $1 & ${SECTION_OFF}
365+
SectionSetFlags ${PhiSec} $1
366+
367+
SectionGetFlags ${Qwen7BSec} $1
368+
IntOp $1 $1 & ${SECTION_OFF}
369+
SectionSetFlags ${Qwen7BSec} $1
370+
371+
end:
268372
FunctionEnd
269373

270374
Function SkipLicense
@@ -276,6 +380,7 @@ FunctionEnd
276380

277381
; MUI Settings
278382
!insertmacro MUI_PAGE_WELCOME
383+
!define MUI_COMPONENTSPAGE_SMALLDESC
279384
!insertmacro MUI_PAGE_COMPONENTS
280385

281386
!define MUI_PAGE_CUSTOMFUNCTION_PRE SkipLicense
@@ -307,18 +412,33 @@ LangString MUI_BUTTONTEXT_FINISH "${LANG_ENGLISH}" "Finish"
307412
LangString MUI_TEXT_LICENSE_TITLE ${LANG_ENGLISH} "AMD License Agreement"
308413
LangString MUI_TEXT_LICENSE_SUBTITLE ${LANG_ENGLISH} "Please review the license terms before installing AMD Ryzen AI Hybrid Execution Mode."
309414
LangString DESC_SEC01 ${LANG_ENGLISH} "The minimum set of dependencies for a lemonade server that runs LLMs on CPU."
310-
LangString DESC_HybridSec ${LANG_ENGLISH} "Add support for running LLMs on Ryzen AI hybrid execution mode, which uses both the NPU and iGPU for improved performance. Only available on Ryzen AI 300-series processors."
415+
LangString DESC_HybridSec ${LANG_ENGLISH} "Add support for running LLMs on Ryzen AI hybrid execution mode. Only available on Ryzen AI 300-series processors."
416+
LangString DESC_ModelsSec ${LANG_ENGLISH} "Select which models to install"
417+
LangString DESC_Qwen05Sec ${LANG_ENGLISH} "Small CPU-only Qwen model"
418+
LangString DESC_Llama1BSec ${LANG_ENGLISH} "1B parameter Llama model with hybrid execution"
419+
LangString DESC_Llama3BSec ${LANG_ENGLISH} "3B parameter Llama model with hybrid execution"
420+
LangString DESC_PhiSec ${LANG_ENGLISH} "Phi-3 Mini model with hybrid execution"
421+
LangString DESC_Qwen7BSec ${LANG_ENGLISH} "7B parameter Qwen model with hybrid execution"
311422

312423
; Insert the description macros
313424
!insertmacro MUI_FUNCTION_DESCRIPTION_BEGIN
314425
!insertmacro MUI_DESCRIPTION_TEXT ${SEC01} $(DESC_SEC01)
315426
!insertmacro MUI_DESCRIPTION_TEXT ${HybridSec} $(DESC_HybridSec)
427+
!insertmacro MUI_DESCRIPTION_TEXT ${ModelsSec} $(DESC_ModelsSec)
428+
!insertmacro MUI_DESCRIPTION_TEXT ${Qwen05Sec} $(DESC_Qwen05Sec)
429+
!insertmacro MUI_DESCRIPTION_TEXT ${Llama1BSec} $(DESC_Llama1BSec)
430+
!insertmacro MUI_DESCRIPTION_TEXT ${Llama3BSec} $(DESC_Llama3BSec)
431+
!insertmacro MUI_DESCRIPTION_TEXT ${PhiSec} $(DESC_PhiSec)
432+
!insertmacro MUI_DESCRIPTION_TEXT ${Qwen7BSec} $(DESC_Qwen7BSec)
316433
!insertmacro MUI_FUNCTION_DESCRIPTION_END
317434

318435
Function .onInit
319436
StrCpy $LEMONADE_SERVER_STRING "Lemonade Server"
320437
StrCpy $LEMONADE_CONDA_ENV "lemon_env"
321438
StrCpy $HYBRID_SELECTED "true"
439+
440+
; Create a variable to store selected models
441+
StrCpy $9 "" ; $9 will hold our list of selected models
322442

323443
; Set the install directory, allowing /D override from CLI install
324444
${If} $InstDir != ""

‎setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@
105105
classifiers=[],
106106
entry_points={
107107
"console_scripts": [
108-
"turnkey=turnkeyml:turnkeycli",
108+
"turnkey=turnkeyml.cli.cli:main",
109109
"turnkey-llm=lemonade:lemonadecli",
110110
"lemonade=lemonade:lemonadecli",
111111
"lemonade-install=lemonade_install:installcli",

‎src/lemonade/cli.py

+18-23
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
import os
2+
from turnkeyml import __version__ as version_number
23
from turnkeyml.tools import FirstTool, NiceHelpFormatter
34
import turnkeyml.common.filesystem as fs
4-
import turnkeyml.cli.cli as cli
5+
import turnkeyml.common.cli_helpers as cli
56
from turnkeyml.sequence import Sequence
67
from turnkeyml.tools.management_tools import Cache, Version, SystemInfo
78
from turnkeyml.state import State
89

910
from lemonade.tools.huggingface_load import (
1011
HuggingfaceLoad,
11-
AdaptHuggingface,
1212
)
1313

1414
from lemonade.tools.huggingface_bench import HuggingfaceBench
@@ -38,7 +38,6 @@ def main():
3838
AccuracyHumaneval,
3939
AccuracyPerplexity,
4040
LLMPrompt,
41-
AdaptHuggingface,
4241
HuggingfaceBench,
4342
OgaBench,
4443
QuarkQuantize,
@@ -62,49 +61,46 @@ def main():
6261

6362
# Define the argument parser
6463
parser = cli.CustomArgumentParser(
65-
description="Turnkey analysis and benchmarking of GenAI models. "
66-
"This utility is a toolchain. To use it, provide a list of tools and "
67-
"their arguments.",
64+
description=f"""Tools for evaluating and deploying LLMs (v{version_number}).
65+
66+
Read this to learn the command syntax:
67+
https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/getting_started.md""",
6868
formatter_class=NiceHelpFormatter,
6969
)
7070

7171
parser.add_argument(
7272
"-i",
7373
"--input",
74-
help="The input that will be evaluated by the tool sequence "
75-
"(e.g., huggingface checkpoints)",
74+
help="The input that will be evaluated by the starting tool "
75+
"(e.g., huggingface checkpoint)",
7676
)
7777

7878
parser.add_argument(
7979
"-d",
8080
"--cache-dir",
81-
help="Cache directory where the results of each tool will "
82-
f"be stored (defaults to {cache.DEFAULT_CACHE_DIR})",
81+
help="Cache directory where tool results are "
82+
f"stored (default: {cache.DEFAULT_CACHE_DIR})",
8383
required=False,
8484
default=cache.DEFAULT_CACHE_DIR,
8585
)
8686

87-
parser.add_argument(
88-
"--lean-cache",
89-
dest="lean_cache",
90-
help="Delete all build artifacts (e.g., .onnx files) when the command completes",
91-
action="store_true",
92-
)
93-
87+
memory_tracking_default_interval = 0.25
9488
parser.add_argument(
9589
"-m",
9690
"--memory",
9791
nargs="?",
9892
metavar="TRACK_INTERVAL",
9993
type=float,
10094
default=None,
101-
const=0.25,
102-
help="Track physical memory usage during the build and generate a plot when the "
103-
"command completes. Optionally, specify the tracking interval (sec), "
104-
"defaults to 0.25 sec.",
95+
const=memory_tracking_default_interval,
96+
help="Track memory usage and plot the results. "
97+
"Optionally, set the tracking interval in seconds "
98+
f"(default: {memory_tracking_default_interval})",
10599
)
106100

107-
global_args, tool_instances, evaluation_tools = cli.parse_tools(parser, tools)
101+
global_args, tool_instances, evaluation_tools = cli.parse_tools(
102+
parser, tools, cli_name="lemonade"
103+
)
108104

109105
if len(evaluation_tools) > 0:
110106
if not issubclass(evaluation_tools[0], FirstTool):
@@ -128,7 +124,6 @@ def main():
128124
)
129125
sequence.launch(
130126
state,
131-
lean_cache=global_args["lean_cache"],
132127
track_memory_interval=global_args["memory"],
133128
)
134129
else:

‎src/lemonade/tools/huggingface_bench.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ def parser(parser: argparse.ArgumentParser = None, add_help: bool = True):
122122
# Allow inherited classes to initialize and pass in a parser, add parameters to it if so
123123
if parser is None:
124124
parser = __class__.helpful_parser(
125-
short_description="Benchmark a torch.nn.Module LLM",
125+
short_description="Benchmark a huggingface-style PyTorch LLM",
126126
add_help=add_help,
127127
)
128128

‎src/lemonade/tools/huggingface_load.py

+3-78
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
from huggingface_hub import model_info
77
from turnkeyml.state import State
88
import turnkeyml.common.status as status
9-
from turnkeyml.tools import Tool, FirstTool
10-
from lemonade.tools.adapter import ModelAdapter, TokenizerAdapter
9+
from turnkeyml.tools import FirstTool
10+
from lemonade.tools.adapter import TokenizerAdapter
1111
from lemonade.cache import Keys
1212

1313
# Command line interfaces for tools will use string inputs for data
@@ -110,7 +110,7 @@ def __init__(self):
110110
@staticmethod
111111
def parser(add_help: bool = True) -> argparse.ArgumentParser:
112112
parser = __class__.helpful_parser(
113-
short_description="Load an LLM as torch.nn.Module using huggingface from_pretrained()",
113+
short_description="Load an LLM in PyTorch using huggingface transformers",
114114
add_help=add_help,
115115
)
116116

@@ -239,78 +239,3 @@ def run(
239239
status.add_to_state(state=state, name=input, model=model)
240240

241241
return state
242-
243-
244-
class HuggingfaceAdapter(ModelAdapter):
245-
"""
246-
Wrapper class for Huggingface LLMs that set generate() arguments to
247-
make them more accurate and pleasant to chat with:
248-
249-
repetition_penalty: helps the LLM avoid repeating the same short
250-
phrase in the response over and over.
251-
temperature: helps the LLM stay focused on the prompt.
252-
do_sample: apply the temperature.
253-
"""
254-
255-
def __init__(self, model, dtype=torch.float32, device="cpu"):
256-
super().__init__()
257-
self.model = model
258-
self.dtype = dtype
259-
self.device = device
260-
261-
def generate(
262-
self,
263-
input_ids,
264-
max_new_tokens=512,
265-
repetition_penalty=1.2,
266-
do_sample=True,
267-
temperature=0.1,
268-
**kwargs,
269-
):
270-
amp_enabled = (
271-
True
272-
if (self.dtype == torch.float16 or self.dtype == torch.bfloat16)
273-
else False
274-
)
275-
276-
# Move input_ids to the same device as the model
277-
input_ids = input_ids.to(self.device)
278-
279-
with torch.no_grad(), torch.inference_mode(), torch.cpu.amp.autocast(
280-
enabled=amp_enabled, dtype=self.dtype
281-
):
282-
return self.model.generate(
283-
input_ids=input_ids,
284-
max_new_tokens=max_new_tokens,
285-
repetition_penalty=repetition_penalty,
286-
do_sample=do_sample,
287-
temperature=temperature,
288-
**kwargs,
289-
)
290-
291-
292-
class AdaptHuggingface(Tool):
293-
"""
294-
Apply specific settings to make Huggingface LLMs
295-
more accurate and pleasant to chat with.
296-
"""
297-
298-
unique_name = "adapt-huggingface"
299-
300-
def __init__(self):
301-
super().__init__(monitor_message="Adapting Huggingface LLM")
302-
303-
@staticmethod
304-
def parser(add_help: bool = True) -> argparse.ArgumentParser:
305-
parser = __class__.helpful_parser(
306-
short_description="Apply accuracy-boosting settings to huggingface LLMs",
307-
add_help=add_help,
308-
)
309-
310-
return parser
311-
312-
def run(self, state: State) -> State:
313-
314-
state.model = HuggingfaceAdapter(state.model, state.dtype, state.device)
315-
316-
return state

‎src/lemonade/tools/humaneval.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def __init__(self):
4242
@staticmethod
4343
def parser(add_help: bool = True) -> argparse.ArgumentParser:
4444
parser = __class__.helpful_parser(
45-
short_description="Run accuracy benchmark using HumanEval dataset",
45+
short_description="Measure coding accuracy with HumanEval",
4646
add_help=add_help,
4747
)
4848
parser.add_argument(

‎src/lemonade/tools/llamacpp.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ def __init__(self):
152152
@staticmethod
153153
def parser(add_help: bool = True) -> argparse.ArgumentParser:
154154
parser = __class__.helpful_parser(
155-
short_description="Wrap Llamacpp models with an API",
155+
short_description="Wrap llama.cpp models with an API",
156156
add_help=add_help,
157157
)
158158

‎src/lemonade/tools/llamacpp_bench.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ def __init__(self):
2424
@staticmethod
2525
def parser(add_help: bool = True) -> argparse.ArgumentParser:
2626
parser = __class__.helpful_parser(
27-
short_description="Benchmark a Llamacpp model",
27+
short_description="Benchmark a llama.cpp model",
2828
add_help=add_help,
2929
)
3030

‎src/lemonade/tools/mmlu.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@ def __init__(self):
4343
@staticmethod
4444
def parser(add_help: bool = True) -> argparse.ArgumentParser:
4545
parser = __class__.helpful_parser(
46-
short_description="Run accuracy benchmark using Massive Multitask "
47-
"Language Understanding (MMLU) test",
46+
short_description="Measure accuracy with Massive Multitask "
47+
"Language Understanding (MMLU)",
4848
add_help=add_help,
4949
)
5050

‎src/lemonade/tools/ort_genai/oga.py

+4
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,10 @@ def generate(
162162
past_present_share_buffer=search_config.get(
163163
"past_present_share_buffer", True
164164
),
165+
# Make sure that results do not vary across laptops
166+
# by default, random_seed=-1 causes different laptops to give
167+
# different results
168+
random_seed=1,
165169
# Not currently supported by OGA
166170
# diversity_penalty=search_config.get('diversity_penalty', 0.0),
167171
# no_repeat_ngram_size=search_config.get('no_repeat_ngram_size', 0),

‎src/lemonade/tools/perplexity.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
class AccuracyPerplexity(Tool):
1414
"""
15-
Measure perplexity of an LLM using the wikitext dataset.
15+
Measure perplexity of an LLM using the Wikitext-2 dataset.
1616
1717
Required input state:
1818
- state.model: instance that provides a __call__() method that returns
@@ -32,7 +32,7 @@ def __init__(self):
3232
@staticmethod
3333
def parser(add_help: bool = True) -> argparse.ArgumentParser:
3434
parser = __class__.helpful_parser(
35-
short_description="Measure Perplexity score using Wikitext-2 dataset",
35+
short_description="Measure perplexity score",
3636
add_help=add_help,
3737
)
3838
return parser

‎src/lemonade/tools/serve.py

+72-35
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,9 @@
44
import time
55
from threading import Thread, Event
66
import logging
7+
import traceback
78

8-
from fastapi import FastAPI, HTTPException, status
9+
from fastapi import FastAPI, HTTPException, status, Request
910
from fastapi.responses import StreamingResponse
1011
from fastapi.middleware.cors import CORSMiddleware
1112
from pydantic import BaseModel
@@ -24,10 +25,9 @@
2425
from turnkeyml.state import State
2526
from turnkeyml.tools.management_tools import ManagementTool
2627
from lemonade.tools.adapter import ModelAdapter
27-
from lemonade.tools.prompt import DEFAULT_GENERATE_PARAMS
2828
from lemonade.tools.huggingface_load import HuggingfaceLoad
2929
from lemonade.cache import DEFAULT_CACHE_DIR
30-
30+
from lemonade_install.install import ModelManager
3131

3232
# Set to a high number to allow for interesting experiences in real apps
3333
# Tests should use the max_new_tokens argument to set a lower value
@@ -36,6 +36,34 @@
3636
DEFAULT_PORT = 8000
3737
DEFAULT_LOG_LEVEL = "info"
3838

39+
LOCAL_MODELS = ModelManager().downloaded_models_enabled
40+
41+
42+
class GeneratorThread(Thread):
43+
"""
44+
Thread class designed for use with streaming generation within
45+
an LLM server. It needs access to the streamer in order to order
46+
to help the completions APIs escape the "for text in streamer" loop.
47+
It also provides exception handling that works nicely with HTTP
48+
servers by providing the stack trace and making the exception
49+
information available to the main thread.
50+
"""
51+
52+
def __init__(self, streamer, *args, **kwargs):
53+
super().__init__(*args, **kwargs)
54+
self.exception = None
55+
self.streamer = streamer
56+
57+
def run(self):
58+
try:
59+
if self._target:
60+
self._target(*self._args, **self._kwargs)
61+
except Exception as e: # pylint: disable=broad-except
62+
self.exception = e
63+
logging.error(f"Exception raised in generate thread: {e}")
64+
traceback.print_exc()
65+
self.streamer.done()
66+
3967

4068
# Custom huggingface-style stopping criteria to allow
4169
# us to halt streaming in-progress generations
@@ -150,6 +178,7 @@ def __init__(self):
150178
self.input_tokens = None
151179
self.output_tokens = None
152180
self.decode_token_times = None
181+
self.process_time = None
153182

154183
# Store debug logging state
155184
self.debug_logging_enabled = logging.getLogger().isEnabledFor(logging.DEBUG)
@@ -169,36 +198,15 @@ def __init__(self):
169198
self._generate_semaphore = asyncio.Semaphore(self.max_concurrent_generations)
170199

171200
# Curated list of "Instruct" and "Chat" models.
172-
self.builtin_models = {
173-
"Qwen2.5-0.5B-Instruct-CPU": {
174-
"checkpoint": "Qwen/Qwen2.5-0.5B-Instruct",
175-
"device": "cpu",
176-
},
177-
"Llama-3.2-1B-Instruct-Hybrid": {
178-
"checkpoint": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
179-
"device": "hybrid",
180-
},
181-
"Llama-3.2-3B-Instruct-Hybrid": {
182-
"checkpoint": "amd/Llama-3.2-3B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
183-
"device": "hybrid",
184-
},
185-
"Phi-3-Mini-Instruct-Hybrid": {
186-
"checkpoint": "amd/Phi-3-mini-4k-instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
187-
"device": "hybrid",
188-
},
189-
"Qwen-1.5-7B-Chat-Hybrid": {
190-
"checkpoint": "amd/Qwen1.5-7B-Chat-awq-g128-int4-asym-fp16-onnx-hybrid",
191-
"device": "hybrid",
192-
},
193-
}
201+
self.local_models = LOCAL_MODELS
194202

195203
# Add lock for load/unload operations
196204
self._load_lock = asyncio.Lock()
197205

198206
@staticmethod
199207
def parser(add_help: bool = True) -> argparse.ArgumentParser:
200208
parser = __class__.helpful_parser(
201-
short_description="Industry Standard Model Server",
209+
short_description="Launch an industry-standard LLM server",
202210
add_help=add_help,
203211
)
204212

@@ -272,6 +280,10 @@ def trace(message, *args, **kwargs):
272280
# Update debug logging state after setting log level
273281
self.debug_logging_enabled = logging.getLogger().isEnabledFor(logging.DEBUG)
274282

283+
if self.debug_logging_enabled:
284+
# Print the elapsed time for each request
285+
self.setup_middleware_timer()
286+
275287
# Only load the model when starting the server if checkpoint was provided
276288
if checkpoint:
277289
config = LoadConfig(
@@ -297,6 +309,7 @@ async def _show_telemetry(self):
297309
["Output tokens", self.output_tokens],
298310
["TTFT (s)", f"{self.time_to_first_token:.2f}"],
299311
["TPS", f"{self.tokens_per_second:.2f}"],
312+
["Total time (s)", f"{self.process_time:.2f}"],
300313
]
301314

302315
table = tabulate(
@@ -313,8 +326,8 @@ async def completions(self, completion_request: CompletionRequest):
313326
if completion_request.model:
314327

315328
# Get model config
316-
if completion_request.model in self.builtin_models:
317-
model_config = self.builtin_models[completion_request.model]
329+
if completion_request.model in self.local_models:
330+
model_config = self.local_models[completion_request.model]
318331
lc = LoadConfig(**model_config)
319332
else:
320333
# If the model is not built-in, we assume it corresponds to a checkpoint
@@ -394,8 +407,8 @@ async def chat_completions(self, chat_completion_request: ChatCompletionRequest)
394407
"""
395408

396409
# Get model config
397-
if chat_completion_request.model in self.builtin_models:
398-
model_config = self.builtin_models[chat_completion_request.model]
410+
if chat_completion_request.model in self.local_models:
411+
model_config = self.local_models[chat_completion_request.model]
399412
lc = LoadConfig(**model_config)
400413
else:
401414
# If the model is not built-in, we assume it corresponds to a checkpoint
@@ -548,7 +561,6 @@ async def _generate_tokens(self, message: str, stop: list[str] | str | None = No
548561
"min_new_tokens": 1,
549562
"pad_token_id": tokenizer.eos_token_id,
550563
"stopping_criteria": stopping_criteria,
551-
**DEFAULT_GENERATE_PARAMS,
552564
}
553565

554566
# Initialize performance variables
@@ -558,7 +570,9 @@ async def _generate_tokens(self, message: str, stop: list[str] | str | None = No
558570
self.output_tokens = 0
559571

560572
# Begin generation
561-
thread = Thread(target=model.generate, kwargs=generation_kwargs)
573+
thread = GeneratorThread(
574+
streamer, target=model.generate, kwargs=generation_kwargs
575+
)
562576
thread.start()
563577

564578
# Acquire the generation semaphore
@@ -621,7 +635,15 @@ async def _generate_tokens(self, message: str, stop: list[str] | str | None = No
621635
self.max_concurrent_generations
622636
- self._generate_semaphore._value # pylint: disable=protected-access
623637
)
624-
logging.debug(f"Active generations: {active_generations}")
638+
639+
# Check if an exception occurred in the generation thread
640+
# If it did, raise it as an HTTPException so that the client
641+
# knows they wont be getting a completion
642+
if thread.exception:
643+
raise HTTPException(
644+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
645+
detail=f"Completion failure {thread.exception}",
646+
)
625647

626648
# Display telemetry if in debug mode
627649
await self._show_telemetry()
@@ -707,7 +729,7 @@ async def load_llm(self, config: LoadConfig):
707729
input=config.checkpoint,
708730
device=config.device,
709731
dtype="int4",
710-
force=True,
732+
force=False,
711733
)
712734
self.max_new_tokens = config.max_new_tokens
713735
self.llm_loaded = config.checkpoint
@@ -766,7 +788,7 @@ async def models(self):
766788
Return a list of available models in OpenAI-compatible format.
767789
"""
768790
models_list = []
769-
for model in self.builtin_models:
791+
for model in self.local_models:
770792
m = Model(
771793
id=model,
772794
owned_by="lemonade",
@@ -776,3 +798,18 @@ async def models(self):
776798
models_list.append(m)
777799

778800
return {"object": "list", "data": models_list}
801+
802+
def setup_middleware_timer(self):
803+
logging.info("Middleware set up")
804+
805+
@self.app.middleware("http")
806+
async def save_process_time(request: Request, call_next):
807+
"""
808+
Save the request processing time for any request, so that is can be
809+
printed as telemetry.
810+
"""
811+
812+
start_time = time.perf_counter()
813+
response = await call_next(request)
814+
self.process_time = time.perf_counter() - start_time
815+
return response

‎src/lemonade_install/install.py

+136-3
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,12 @@
1010
import subprocess
1111
import sys
1212
import shutil
13+
import pkg_resources
1314
from pathlib import Path
1415
from typing import Optional
1516
import zipfile
1617
import requests
17-
18+
import huggingface_hub
1819

1920
lemonade_install_dir = Path(__file__).parent.parent.parent
2021
DEFAULT_AMD_OGA_NPU_DIR = os.path.join(
@@ -33,6 +34,123 @@
3334
)
3435

3536

37+
class ModelManager:
38+
39+
@property
40+
def supported_cpu_models(self) -> dict:
41+
"""
42+
Returns a dictionary of supported CPU models.
43+
Note: Models must be downloaded before they are locally available.
44+
"""
45+
return {
46+
"Qwen2.5-0.5B-Instruct-CPU": {
47+
"checkpoint": "Qwen/Qwen2.5-0.5B-Instruct",
48+
"device": "cpu",
49+
}
50+
}
51+
52+
@property
53+
def supported_hybrid_models(self) -> dict:
54+
"""
55+
Returns a dictionary of supported hybrid models.
56+
Note: Models must be downloaded before they are locally available.
57+
"""
58+
return {
59+
"Llama-3.2-1B-Instruct-Hybrid": {
60+
"checkpoint": "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
61+
"device": "hybrid",
62+
},
63+
"Llama-3.2-3B-Instruct-Hybrid": {
64+
"checkpoint": "amd/Llama-3.2-3B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
65+
"device": "hybrid",
66+
},
67+
"Phi-3-Mini-Instruct-Hybrid": {
68+
"checkpoint": "amd/Phi-3-mini-4k-instruct-awq-g128-int4-asym-fp16-onnx-hybrid",
69+
"device": "hybrid",
70+
},
71+
"Qwen-1.5-7B-Chat-Hybrid": {
72+
"checkpoint": "amd/Qwen1.5-7B-Chat-awq-g128-int4-asym-fp16-onnx-hybrid",
73+
"device": "hybrid",
74+
},
75+
}
76+
77+
@property
78+
def supported_models(self) -> dict:
79+
"""
80+
Returns a dictionary of all supported models across all supported backends.
81+
"""
82+
return {**self.supported_cpu_models, **self.supported_hybrid_models}
83+
84+
@property
85+
def downloaded_hf_checkpoints(self) -> list[str]:
86+
"""
87+
Returns a list of Hugging Face checkpoints that have been downloaded.
88+
"""
89+
downloaded_hf_checkpoints = []
90+
try:
91+
hf_cache_info = huggingface_hub.scan_cache_dir()
92+
downloaded_hf_checkpoints = [entry.repo_id for entry in hf_cache_info.repos]
93+
except huggingface_hub.CacheNotFound:
94+
pass
95+
except Exception as e: # pylint: disable=broad-exception-caught
96+
print(f"Error scanning Hugging Face cache: {e}")
97+
return downloaded_hf_checkpoints
98+
99+
@property
100+
def downloaded_cpu_models(self) -> dict:
101+
"""
102+
Returns a dictionary of locally available CPU models.
103+
"""
104+
downloaded_cpu_models = {}
105+
for model in self.supported_cpu_models:
106+
if (
107+
self.supported_cpu_models[model]["checkpoint"]
108+
in self.downloaded_hf_checkpoints
109+
):
110+
downloaded_cpu_models[model] = self.supported_cpu_models[model]
111+
return downloaded_cpu_models
112+
113+
@property
114+
def downloaded_hybrid_models(self) -> dict:
115+
"""
116+
Returns a dictionary of locally available hybrid models.
117+
"""
118+
downloaded_hybrid_models = {}
119+
for model in self.supported_hybrid_models:
120+
if (
121+
self.supported_hybrid_models[model]["checkpoint"]
122+
in self.downloaded_hf_checkpoints
123+
):
124+
downloaded_hybrid_models[model] = self.supported_hybrid_models[model]
125+
return downloaded_hybrid_models
126+
127+
@property
128+
def downloaded_models_enabled(self) -> dict:
129+
"""
130+
Returns a dictionary of locally available models that are enabled by the current installation.
131+
"""
132+
downloaded_models_enabled = self.downloaded_cpu_models.copy()
133+
if (
134+
"onnxruntime-vitisai" in pkg_resources.working_set.by_key
135+
and "onnxruntime-genai-directml" in pkg_resources.working_set.by_key
136+
):
137+
downloaded_models_enabled.update(self.downloaded_hybrid_models)
138+
return downloaded_models_enabled
139+
140+
def download_models(self, models: list[str]):
141+
"""
142+
Downloads the specified models from Hugging Face.
143+
"""
144+
for model in models:
145+
if model not in self.supported_models:
146+
raise ValueError(
147+
f"Model {model} is not supported. Please choose from the following: {list(self.supported_models.keys())}"
148+
)
149+
checkpoint = self.supported_models[model]["checkpoint"]
150+
print(f"Downloading {model} ({checkpoint})")
151+
huggingface_hub.snapshot_download(repo_id=checkpoint)
152+
153+
36154
def download_lfs_file(token, file, output_filename):
37155
"""Downloads a file from LFS"""
38156
# Set up the headers for the request
@@ -201,6 +319,14 @@ def parser() -> argparse.ArgumentParser:
201319
choices=["0.6.0"],
202320
)
203321

322+
parser.add_argument(
323+
"--models",
324+
help="One or more models to download",
325+
type=str,
326+
nargs="+",
327+
choices=ModelManager().supported_models,
328+
)
329+
204330
return parser
205331

206332
def run(
@@ -209,11 +335,18 @@ def run(
209335
quark: Optional[str] = None,
210336
yes: bool = False,
211337
token: Optional[str] = None,
338+
models: Optional[str] = None,
212339
):
213-
if ryzenai is None and quark is None:
340+
if ryzenai is None and quark is None and models is None:
214341
raise ValueError(
215-
"You must select something to install, for example `--ryzenai` and/or `--quark`"
342+
"You must select something to install, for example `--ryzenai`, `--quark`, or `--models`"
216343
)
344+
345+
# Download models if needed
346+
if models is not None:
347+
model_manager = ModelManager()
348+
model_manager.download_models(models)
349+
217350
if ryzenai is not None:
218351
if ryzenai == "npu":
219352
file = "ryzen_ai_13_ga/npu-llm-artifacts_1.3.0.zip"

‎src/turnkeyml/__init__.py

-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
11
from turnkeyml.version import __version__
22

3-
from .files_api import evaluate_files
4-
from .cli.cli import main as turnkeycli
53
from .state import load_state, State

‎src/turnkeyml/cli/cli.py

+3-137
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,13 @@
1-
import argparse
2-
import sys
31
import os
42
from difflib import get_close_matches
5-
from typing import List, Dict, Tuple, Any
3+
from typing import List
64
import turnkeyml.common.filesystem as fs
75
from turnkeyml.sequence import Sequence
8-
from turnkeyml.tools import Tool, FirstTool, NiceHelpFormatter
6+
from turnkeyml.tools import FirstTool, NiceHelpFormatter
97
from turnkeyml.sequence.tool_plugins import get_supported_tools
108
from turnkeyml.cli.spawn import DEFAULT_TIMEOUT_SECONDS
119
from turnkeyml.files_api import evaluate_files
12-
import turnkeyml.common.printing as printing
13-
from turnkeyml.tools.management_tools import ManagementTool
14-
15-
16-
class CustomArgumentParser(argparse.ArgumentParser):
17-
18-
def error(self, message):
19-
self.print_usage()
20-
printing.log_error(message)
21-
self.exit(2)
22-
23-
24-
def _tool_list_help(tools: List[Tool], subclass, exclude=None) -> str:
25-
help = ""
26-
27-
for tool_class in tools:
28-
if exclude and issubclass(tool_class, exclude):
29-
continue
30-
if issubclass(tool_class, subclass):
31-
help = (
32-
help
33-
+ f" * {tool_class.unique_name}: {tool_class.parser().short_description}\n"
34-
)
35-
36-
return help
10+
from turnkeyml.common.cli_helpers import parse_tools, CustomArgumentParser
3711

3812

3913
def _check_extension(
@@ -63,114 +37,6 @@ def _check_extension(
6337
return file_name
6438

6539

66-
def parse_tools(
67-
parser: argparse.ArgumentParser, supported_tools: List[Tool]
68-
) -> Tuple[Dict[str, Any], Dict[Tool, List[str]], List[str]]:
69-
"""
70-
Add the help for parsing tools and their args to an ArgumentParser.
71-
72-
Then, perform the task of parsing a full turnkey CLI command including
73-
teasing apart the global arguments and separate tool invocations.
74-
"""
75-
76-
tool_parsers = {tool.unique_name: tool.parser() for tool in supported_tools}
77-
tool_classes = {tool.unique_name: tool for tool in supported_tools}
78-
79-
# Sort tools into categories and format for the help menu
80-
first_tool_choices = _tool_list_help(supported_tools, FirstTool)
81-
eval_tool_choices = _tool_list_help(supported_tools, Tool, exclude=FirstTool)
82-
mgmt_tool_choices = _tool_list_help(supported_tools, ManagementTool)
83-
84-
tools_action = parser.add_argument(
85-
"tools",
86-
metavar="tool --tool-args [tool --tool-args...]",
87-
nargs="?",
88-
help=f"""\
89-
Available tools that can be sequenced together to perform a build.
90-
91-
Call `turnkey TOOL -h` to learn more about each tool.
92-
93-
Tools that can start a sequence:
94-
{first_tool_choices}
95-
Tools that go into a sequence:
96-
{eval_tool_choices}
97-
Management tool choices:
98-
{mgmt_tool_choices}""",
99-
choices=tool_parsers.keys(),
100-
)
101-
102-
# run as if "-h" was passed if no parameters are passed
103-
if len(sys.argv) == 1:
104-
sys.argv.append("-h")
105-
106-
# Break sys.argv into categories based on which tools were invoked
107-
# Arguments that are passed prior to invoking a tool are categorized as
108-
# global arguments that should be used to initialize the state.
109-
current_tool = "globals"
110-
tools_invoked = {current_tool: []}
111-
cmd = sys.argv[1:]
112-
while len(cmd):
113-
if cmd[0] in tool_parsers.keys():
114-
# Make sure each tool was only called once
115-
if cmd[0] in tools_invoked.keys():
116-
parser.error(
117-
"A single call to turnkey can only invoke each tool once, "
118-
f"however this call invokes tool {cmd[0]} multiple times."
119-
)
120-
current_tool = cmd.pop(0)
121-
tools_invoked[current_tool] = []
122-
else:
123-
tools_invoked[current_tool].append(cmd.pop(0))
124-
125-
# Trick argparse into thinking tools was not a positional argument
126-
# this helps to avoid an error where an incorrect arg/value pair
127-
# can be misinterpreted as the tools positional argument
128-
tools_action.option_strings = ["--tools"]
129-
130-
# Do one pass of parsing to figure out if -h was used
131-
global_args = vars(parser.parse_args(tools_invoked["globals"]))
132-
133-
# Remove "tools" from global args because it was just there
134-
# as a placeholder
135-
global_args.pop("tools")
136-
137-
# Remove globals from the list since its already been parsed
138-
tools_invoked.pop("globals")
139-
evaluation_tools = []
140-
management_tools = []
141-
for cmd, argv in tools_invoked.items():
142-
tool_parsers[cmd].parse_args(argv)
143-
144-
# Keep track of whether the tools are ManagementTool or not,
145-
# since ManagementTools are mutually exclusive with evaluation
146-
# tools
147-
if issubclass(tool_classes[cmd], ManagementTool):
148-
management_tools.append(cmd)
149-
else:
150-
evaluation_tools.append(cmd)
151-
152-
if len(management_tools) > 0 and len(evaluation_tools) > 0:
153-
parser.error(
154-
"This call to turnkey invoked both management and "
155-
"evaluation tools, however each call to turnkey "
156-
"is only allowed to invoke one or the other. "
157-
f"Management tools: {management_tools};"
158-
f"Evaluation tools: {evaluation_tools}."
159-
)
160-
161-
if len(management_tools) == 0 and len(evaluation_tools) == 0:
162-
parser.error(
163-
"Calls to turnkey are required to call at least "
164-
"one tool or management tool."
165-
)
166-
167-
# Convert tool names into Tool instances
168-
tool_instances = {tool_classes[cmd](): argv for cmd, argv in tools_invoked.items()}
169-
evaluation_tools = [tool_classes[cmd] for cmd in evaluation_tools]
170-
171-
return global_args, tool_instances, evaluation_tools
172-
173-
17440
def main():
17541

17642
supported_tools = get_supported_tools()

‎src/turnkeyml/common/cli_helpers.py

+135
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
import argparse
2+
import sys
3+
from typing import List, Dict, Tuple, Any
4+
from turnkeyml.tools import Tool, FirstTool
5+
import turnkeyml.common.printing as printing
6+
from turnkeyml.tools.management_tools import ManagementTool
7+
8+
9+
class CustomArgumentParser(argparse.ArgumentParser):
10+
11+
def error(self, message):
12+
self.print_usage()
13+
printing.log_error(message)
14+
self.exit(2)
15+
16+
17+
def _tool_list_help(tools: List[Tool], subclass, exclude=None) -> str:
18+
help = ""
19+
20+
for tool_class in tools:
21+
if exclude and issubclass(tool_class, exclude):
22+
continue
23+
if issubclass(tool_class, subclass):
24+
help = (
25+
help
26+
+ f" * {tool_class.unique_name}: {tool_class.parser().short_description}\n"
27+
)
28+
29+
return help
30+
31+
32+
def parse_tools(
33+
parser: argparse.ArgumentParser, supported_tools: List[Tool], cli_name="turnkey"
34+
) -> Tuple[Dict[str, Any], Dict[Tool, List[str]], List[str]]:
35+
"""
36+
Add the help for parsing tools and their args to an ArgumentParser.
37+
38+
Then, perform the task of parsing a full turnkey CLI command including
39+
teasing apart the global arguments and separate tool invocations.
40+
"""
41+
42+
tool_parsers = {tool.unique_name: tool.parser() for tool in supported_tools}
43+
tool_classes = {tool.unique_name: tool for tool in supported_tools}
44+
45+
# Sort tools into categories and format for the help menu
46+
first_tool_choices = _tool_list_help(supported_tools, FirstTool)
47+
eval_tool_choices = _tool_list_help(supported_tools, Tool, exclude=FirstTool)
48+
mgmt_tool_choices = _tool_list_help(supported_tools, ManagementTool)
49+
50+
tools_action = parser.add_argument(
51+
"tools",
52+
metavar="tool --tool-args [tool --tool-args...]",
53+
nargs="?",
54+
help=f"""\
55+
Run `{cli_name} TOOL -h` to learn more about each tool.
56+
57+
Tools that can start a sequence:
58+
{first_tool_choices}
59+
Tools that go into a sequence:
60+
{eval_tool_choices}
61+
Management tools:
62+
{mgmt_tool_choices}""",
63+
choices=tool_parsers.keys(),
64+
)
65+
66+
# run as if "-h" was passed if no parameters are passed
67+
if len(sys.argv) == 1:
68+
sys.argv.append("-h")
69+
70+
# Break sys.argv into categories based on which tools were invoked
71+
# Arguments that are passed prior to invoking a tool are categorized as
72+
# global arguments that should be used to initialize the state.
73+
current_tool = "globals"
74+
tools_invoked = {current_tool: []}
75+
cmd = sys.argv[1:]
76+
while len(cmd):
77+
if cmd[0] in tool_parsers.keys():
78+
# Make sure each tool was only called once
79+
if cmd[0] in tools_invoked.keys():
80+
parser.error(
81+
"A single call to turnkey can only invoke each tool once, "
82+
f"however this call invokes tool {cmd[0]} multiple times."
83+
)
84+
current_tool = cmd.pop(0)
85+
tools_invoked[current_tool] = []
86+
else:
87+
tools_invoked[current_tool].append(cmd.pop(0))
88+
89+
# Trick argparse into thinking tools was not a positional argument
90+
# this helps to avoid an error where an incorrect arg/value pair
91+
# can be misinterpreted as the tools positional argument
92+
tools_action.option_strings = ["--tools"]
93+
94+
# Do one pass of parsing to figure out if -h was used
95+
global_args = vars(parser.parse_args(tools_invoked["globals"]))
96+
97+
# Remove "tools" from global args because it was just there
98+
# as a placeholder
99+
global_args.pop("tools")
100+
101+
# Remove globals from the list since its already been parsed
102+
tools_invoked.pop("globals")
103+
evaluation_tools = []
104+
management_tools = []
105+
for cmd, argv in tools_invoked.items():
106+
tool_parsers[cmd].parse_args(argv)
107+
108+
# Keep track of whether the tools are ManagementTool or not,
109+
# since ManagementTools are mutually exclusive with evaluation
110+
# tools
111+
if issubclass(tool_classes[cmd], ManagementTool):
112+
management_tools.append(cmd)
113+
else:
114+
evaluation_tools.append(cmd)
115+
116+
if len(management_tools) > 0 and len(evaluation_tools) > 0:
117+
parser.error(
118+
"This call to turnkey invoked both management and "
119+
"evaluation tools, however each call to turnkey "
120+
"is only allowed to invoke one or the other. "
121+
f"Management tools: {management_tools};"
122+
f"Evaluation tools: {evaluation_tools}."
123+
)
124+
125+
if len(management_tools) == 0 and len(evaluation_tools) == 0:
126+
parser.error(
127+
"Calls to turnkey are required to call at least "
128+
"one tool or management tool."
129+
)
130+
131+
# Convert tool names into Tool instances
132+
tool_instances = {tool_classes[cmd](): argv for cmd, argv in tools_invoked.items()}
133+
evaluation_tools = [tool_classes[cmd] for cmd in evaluation_tools]
134+
135+
return global_args, tool_instances, evaluation_tools

‎src/turnkeyml/tools/management_tools.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ def parser(add_help: bool = True) -> argparse.ArgumentParser:
122122
# passed directly to the `run()` method
123123

124124
parser = __class__.helpful_parser(
125-
short_description="Manage the turnkey build cache " f"",
125+
short_description="Manage the build cache " f"",
126126
add_help=add_help,
127127
)
128128

‎src/turnkeyml/version.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "6.0.2"
1+
__version__ = "6.0.3"

‎test/lemonade/server.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ def test_004_test_models(self):
212212
assert len(l.data) > 0
213213

214214
# Check that the list contains the models we expect
215-
assert any(model.id == "Llama-3.2-1B-Instruct-Hybrid" for model in l.data)
215+
assert any(model.id == "Qwen2.5-0.5B-Instruct-CPU" for model in l.data)
216216

217217
# Endpoint: /api/v0/completions
218218
def test_005_test_completions(self):

0 commit comments

Comments
 (0)
Please sign in to comment.