You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"
80
80
```
81
81
82
-
The LLM will run on CPU with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like.
82
+
The LLM will run with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like.
83
83
84
84
You can also replace the `facebook/opt-125m` with any Hugging Face checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.
85
85
86
-
You can also set the `--device` argument in `huggingface-load` to load your LLM on a different device.
86
+
You can also set the `--device` argument in `oga-load` and `huggingface-load` to load your LLM on a different device.
87
87
88
88
Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about those tools.
89
89
90
90
## Accuracy
91
91
92
92
To measure the accuracy of an LLM using MMLU, try this:
That command will run a few warmup iterations, then a few generation iterations where performance data is collected.
123
123
124
-
The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade huggingface-bench -h`.
124
+
The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade oga-bench -h` `lemonade huggingface-bench -h`.
125
125
126
126
## Memory Usage
127
127
128
-
The peak memory used by the lemonade build is captured in the build output. To capture more granular
128
+
The peak memory used by the `lemonade` build is captured in the build output. To capture more granular
129
129
memory usage information, use the `--memory` flag. For example:
@@ -163,8 +163,9 @@ Lemonade is also available via API.
163
163
164
164
## LEAP APIs
165
165
166
-
The lemonade enablement platform (LEAP) API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) or backend (e.g., CPU, iGPU, Hybrid):
166
+
The lemonade enablement platform (LEAP) API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid). This makes it easy to integrate lemonade LLMs into Python applications.
167
167
168
+
OGA iGPU:
168
169
```python
169
170
from lemonade import leap
170
171
@@ -180,7 +181,9 @@ You can learn more about the LEAP APIs [here](https://github.com/onnx/turnkeyml/
180
181
181
182
## Low-Level API
182
183
183
-
Here's a quick example of how to benchmark an LLM using the low-level API, which calls tools one by one:
184
+
The low-level API is useful for designing custom experiments, for example to sweep over many checkpoints, devices, and/or tools.
185
+
186
+
Here's a quick example of how to prompt a Hugging Face LLM using the low-level API, which calls the load and prompt tools one by one:
Copy file name to clipboardexpand all lines: docs/lemonade/ort_genai_igpu.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# OnnxRuntime GenAI (OGA) for iGPU and CPU
2
2
3
-
[onnxruntime-genai (aka OGA)](https://github.com/microsoft/onnxruntime-genai/tree/main?tab=readme-ov-file) is a new framework created by Microsoft for running ONNX LLMs
3
+
[onnxruntime-genai (aka OGA)](https://github.com/microsoft/onnxruntime-genai/tree/main?tab=readme-ov-file) is a new framework created by Microsoft for running ONNX LLMs.
4
4
5
5
## Installation
6
6
@@ -32,9 +32,9 @@ See [lemonade installation](https://github.com/onnx/turnkeyml/blob/main/docs/lem
32
32
## Directory structure:
33
33
- The model_builder tool caches Hugging Face files and temporary ONNX external data files in `<LEMONADE CACHE>\model_builder`
34
34
- The output from model_builder is stored in `<LEMONADE_CACHE>\oga_models\<MODELNAME>\<SUBFOLDER>`
35
-
-`MODELNAME` is the Hugging Face checkpoint name where any '/' is mapped to an '_' and everything is lower case
36
-
-`SUBFOLDER` is `<EP>-<DTYPE>`, where `EP` is the execution provider (`dml` for igpu, `cpu` for cpu, and `npu` for npu) and `DTYPE` is the datatype
37
-
- If the --int4-block-size flag is used then `SUBFOLDER` is` <EP>-<DTYPE>-block-<SIZE>` where `SIZE` is the specified block size
35
+
-`MODELNAME` is the Hugging Face checkpoint name where any '/' is mapped to an '_' and everything is lower case.
36
+
-`SUBFOLDER` is `<EP>-<DTYPE>`, where `EP` is the execution provider (`dml` for igpu, `cpu` for cpu, and `npu` for npu) and `DTYPE` is the datatype.
37
+
- If the --int4-block-size flag is used then `SUBFOLDER` is` <EP>-<DTYPE>-block-<SIZE>` where `SIZE` is the specified block size.
38
38
- Other ONNX models in the format required by onnxruntime-genai can be loaded in lemonade if placed in the `<LEMONADE_CACHE>\oga_models` folder.
39
39
- Use the -i and --subfolder flags to specify the folder and subfolder:
0 commit comments