These examples showcase inference of text-generation Large Language Models (LLMs): chatglm
, LLaMA
, Qwen
and other models with the same signature. The applications don't have many configuration options to encourage the reader to explore and modify the source code. Loading user_ov_extensions
provided by openvino-tokenizers
to ov::Core
enables tokenization. Run convert_tokenizer
to generate IRs for the samples. group_beam_searcher.hpp implements the algorithm of the same name, which is used by beam_search_causal_lm
. There is also a Jupyter notebook which provides an example of LLM-powered Chatbot in Python.
The program loads a tokenizer, a detokenizer and a model (.xml
and .bin
) to OpenVINO. A prompt is tokenized and passed to the model. The model greedily generates token by token until the special end of sequence (EOS) token is obtained. The predicted tokens are converted to chars and printed in a streaming fashion.
The program loads a tokenizer, a detokenizer and a model (.xml
and .bin
) to OpenVINO. A prompt is tokenized and passed to the model. The model predicts a distribution over the next tokens and group beam search samples from that distribution to explore possible sequesnses. The result is converted to chars and printed.
Install OpenVINO Archives >= 2023.3. <INSTALL_DIR>
below refers to the extraction location.
git submodule update --init
source <INSTALL_DIR>/setupvars.sh
cmake -DCMAKE_BUILD_TYPE=Release -S ./ -B ./build/ && cmake --build ./build/ -j
git submodule update --init
<INSTALL_DIR>\setupvars.bat
cmake -S .\ -B .\build\ && cmake --build .\build\ --config Release -j
The --upgrade-strategy eager
option is needed to ensure optimum-intel
is upgraded to the latest version.
source <INSTALL_DIR>/setupvars.sh
python3 -m pip install --upgrade-strategy eager "optimum>=1.14" -r ../../../llm_bench/python/requirements.txt ../../../thirdparty/openvino_contrib/modules/custom_operations/[transformers] --extra-index-url https://download.pytorch.org/whl/cpu
python3 ../../../llm_bench/python/convert.py --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir ./TinyLlama-1.1B-Chat-v1.0/ --precision FP16 --stateful
convert_tokenizer ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ --output ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ --with-detokenizer --trust-remote-code
<INSTALL_DIR>\setupvars.bat
python -m pip install --upgrade-strategy eager "optimum>=1.14" -r ..\..\..\llm_bench\python\requirements.txt ..\..\..\thirdparty\openvino_contrib\modules\custom_operations\[transformers] --extra-index-url https://download.pytorch.org/whl/cpu
python ..\..\..\llm_bench\python\convert.py --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir .\TinyLlama-1.1B-Chat-v1.0\ --precision FP16 --stateful
convert_tokenizer .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ --output .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ --with-detokenizer --trust-remote-code
Usage:
greedy_causal_lm <MODEL_DIR> "<PROMPT>"
beam_search_causal_lm <MODEL_DIR> "<PROMPT>"
Examples:
./build/greedy_causal_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "Why is the Sun yellow?"
./build/beam_search_causal_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "Why is the Sun yellow?"
To enable Unicode characters for Windows cmd open Region
settings from Control panel
. Administrative
->Change system locale
->Beta: Use Unicode UTF-8 for worldwide language support
->OK
. Reboot.
- chatglm
- https://huggingface.co/THUDM/chatglm2-6b - refer to
chatglm2-6b - AttributeError: can't set attribute
in case of
AttributeError
- https://huggingface.co/THUDM/chatglm3-6b
- https://huggingface.co/THUDM/chatglm2-6b - refer to
chatglm2-6b - AttributeError: can't set attribute
in case of
- LLaMA 2
- https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-13b-hf
- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-7b-hf
- https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-70b-hf
- Llama2-7b-WhoIsHarryPotter
- OpenLLaMA
- TinyLlama
- Qwen
- https://huggingface.co/Qwen/Qwen-7B-Chat
- https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 - refer to
Qwen-7B-Chat-Int4 - Torch not compiled with CUDA enabled
in case of
AssertionError
This pipeline can work with other similar topologies produced by optimum-intel
with the same model signature.