Skip to content

Commit 27083bd

Browse files
authored
Add prompt lookup decoding (#379)
Ticket: 138549
1 parent f4d50db commit 27083bd

File tree

4 files changed

+364
-2
lines changed

4 files changed

+364
-2
lines changed

.github/workflows/causal_lm_cpp.yml

+46
Original file line numberDiff line numberDiff line change
@@ -353,6 +353,52 @@ jobs:
353353
"
354354
echo "Alan Turing was a" passed
355355
356+
357+
cpp-prompt_lookup_decoding_lm-ubuntu:
358+
runs-on: ubuntu-20.04-16-cores
359+
steps:
360+
- uses: actions/checkout@v4
361+
with:
362+
submodules: recursive
363+
- uses: actions/setup-python@v4
364+
with:
365+
python-version: 3.8
366+
- name: Install OpenVINO
367+
run: |
368+
mkdir ./ov/
369+
curl https://storage.openvinotoolkit.org/repositories/openvino/packages/nightly/2024.1.0-14645-e6dc0865128/l_openvino_toolkit_ubuntu20_2024.1.0.dev20240304_x86_64.tgz | tar --directory ./ov/ --strip-components 1 -xz
370+
sudo ./ov/install_dependencies/install_openvino_dependencies.sh
371+
- name: Download, convert and build
372+
run: |
373+
source ./ov/setupvars.sh
374+
python -m pip install --upgrade-strategy eager "optimum>=1.14" -r ./llm_bench/python/requirements.txt "transformers<4.38" ./thirdparty/openvino_tokenizers/[transformers] --extra-index-url https://download.pytorch.org/whl/cpu
375+
python ./llm_bench/python/convert.py --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir ./TinyLlama-1.1B-Chat-v1.0/ --precision FP16
376+
convert_tokenizer ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ --output ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ --with-detokenizer
377+
cmake -DCMAKE_BUILD_TYPE=Release -S ./text_generation/causal_lm/cpp/ -B ./build/
378+
cmake --build ./build/ --config Release -j
379+
wait
380+
- name: run and compare
381+
run: |
382+
source ./ov/setupvars.sh
383+
384+
echo 'Code:```python
385+
def add(a, b):
386+
return a + b
387+
```
388+
Question: Can you please add 2 and 3
389+
A:' > ./prompt.txt
390+
391+
./build/prompt_lookup_decoding_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "$(<prompt.txt)" > predictions_prompt_lookup.txt
392+
./build/greedy_causal_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "$(<prompt.txt)" > predictions_greedy.txt
393+
python -c "
394+
with open('predictions_greedy.txt', 'r') as f:
395+
predicted_greedy = f.readline()
396+
with open('predictions_prompt_lookup.txt', 'r') as f:
397+
predicted_prompt_lookup = f.readline()
398+
assert predicted_greedy == predicted_prompt_lookup
399+
"
400+
echo "Prompt lookup" passed
401+
356402
cpp-Phi-1_5:
357403
runs-on: ubuntu-20.04-16-cores
358404
steps:

text_generation/causal_lm/cpp/CMakeLists.txt

+8
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,11 @@ find_package(OpenVINO REQUIRED COMPONENTS Runtime)
2828
target_link_libraries(speculative_decoding_lm PRIVATE openvino::runtime)
2929
set_target_properties(speculative_decoding_lm PROPERTIES CXX_STANDARD 17)
3030
set_target_properties(speculative_decoding_lm PROPERTIES CXX_STANDARD_REQUIRED ON)
31+
32+
add_executable(prompt_lookup_decoding_lm prompt_lookup_decoding_lm.cpp)
33+
target_compile_definitions(prompt_lookup_decoding_lm PRIVATE OPENVINO_TOKENIZERS_PATH=\"$<TARGET_FILE:openvino_tokenizers>\")
34+
target_include_directories(prompt_lookup_decoding_lm PRIVATE ./)
35+
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
36+
target_link_libraries(prompt_lookup_decoding_lm PRIVATE openvino::runtime)
37+
set_target_properties(prompt_lookup_decoding_lm PROPERTIES CXX_STANDARD 17)
38+
set_target_properties(prompt_lookup_decoding_lm PROPERTIES CXX_STANDARD_REQUIRED ON)

text_generation/causal_lm/cpp/README.md

+9-2
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,18 @@ The program loads a tokenizer, a detokenizer and a model (`.xml` and `.bin`) to
3636

3737
The program loads a tokenizer, a detokenizer and a model (`.xml` and `.bin`) to OpenVINO. A prompt is tokenized and passed to the model. The model predicts a distribution over the next tokens and group beam search samples from that distribution to explore possible sequesnses. The result is converted to chars and printed.
3838

39-
### speculative_sampling_lm
39+
### speculative_decoding_lm
4040

4141
Speculative decoding (or [assisted-generation](https://huggingface.co/blog/assisted-generation#understanding-text-generation-latency) in HF terminology) is a recent technique, that allows to speed up token generation when an additional smaller draft model is used alonside with the main model.
4242

4343
Speculative decoding works the following way. The draft model predicts the next K tokens one by one in an autoregressive manner, while the main model validates these predictions and corrects them if necessary. We go through each predicted token, and if a difference is detected between the draft and main model, we stop and keep the last token predicted by the main model. Then the draft model gets the latest main prediction and again tries to predict the next K tokens, repeating the cycle.
4444

4545
This approach reduces the need for multiple infer requests to the main model, enhancing performance. For instance, in more predictable parts of text generation, the draft model can, in best-case scenarios, generate the next K tokens that exactly match the target. In tha caste the are validated in a single inference request to the main model (which is bigger, more accurate but slower) instead of running K subsequent requests. More details can be found in the original paper https://arxiv.org/pdf/2211.17192.pdf, https://arxiv.org/pdf/2302.01318.pdf
4646

47+
### prompt_lookup_decoding_lm
48+
49+
[Prompt Lookup decoding](https://github.com/apoorvumang/prompt-lookup-decoding) is [assested-generation](https://huggingface.co/blog/assisted-generation#understanding-text-generation-latency) technique where the draft model is replaced with simple string matching the prompt to generate candidate token sequences. This method highly effective for input grounded generation (summarization, document QA, multi-turn chat, code editing), where there is high n-gram overlap between LLM input (prompt) and LLM output. This could be entity names, phrases, or code chunks that the LLM directly copies from the input while generating the output. Prompt lookup exploits this pattern to speed up autoregressive decoding in LLMs. This results in significant speedups with no effect on output quality.
50+
4751
> [!NOTE]
4852
>Models should belong to the same family and have same tokenizers.
4953
@@ -96,19 +100,22 @@ convert_tokenizer .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ --output .\TinyL
96100
### Usage:
97101
1. `greedy_causal_lm <MODEL_DIR> "<PROMPT>"`
98102
2. `beam_search_causal_lm <MODEL_DIR> "<PROMPT>"`
99-
2. `speculative_decoding_lm <DRAFT_MODEL_DIR> <MAIN_MODEL_DIR> "<PROMPT>"`
103+
3. `speculative_decoding_lm <DRAFT_MODEL_DIR> <MAIN_MODEL_DIR> "<PROMPT>"`
104+
4. `prompt_lookup_decoding_lm <MODEL_DIR> "<PROMPT>"`
100105

101106
### Examples:
102107

103108
#### Linux/MacOS:
104109
1. `./build/greedy_causal_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "Why is the Sun yellow?"`
105110
2. `./build/beam_search_causal_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "Why is the Sun yellow?"`
106111
3. `./build/speculative_decoding_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ ./Llama-2-7b-chat-hf/pytorch/dldt/FP16/ "Why is the Sun yellow?"`
112+
4. `./build/prompt_lookup_decoding_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "Why is the Sun yellow?"`
107113

108114
#### Windows:
109115
1. `.\build\Release\greedy_causal_lm .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ "Why is the Sun yellow?"`
110116
2. `.\build\Release\beam_search_causal_lm .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ "Why is the Sun yellow?"`
111117
3. `.\build\Release\speculative_decoding_lm .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ .\Llama-2-7b-chat-hf\pytorch\dldt\FP16\ "Why is the Sun yellow?"`
118+
4. `.\build\Release\prompt_lookup_decoding_lm .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ "Why is the Sun yellow?"`
112119

113120
To enable Unicode characters for Windows cmd open `Region` settings from `Control panel`. `Administrative`->`Change system locale`->`Beta: Use Unicode UTF-8 for worldwide language support`->`OK`. Reboot.
114121

0 commit comments

Comments
 (0)