Skip to content

Commit 699abb2

Browse files
committed
Add README.md
1 parent 3f103b2 commit 699abb2

File tree

1 file changed

+52
-0
lines changed

1 file changed

+52
-0
lines changed

modules/llama_cpp_plugin/README.md

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
### Build instructions
2+
3+
This plugin should be built in the same fashion as the rest of the modules:
4+
5+
1. Check out the OpenVINO repository proper (https://github.com/openvinotoolkit/openvino)
6+
2. Configure the CMake build of the OpenVINO repository, making sure to point the corresponding CMake option to the location of the `openvino_contrib` repository. The command below, executed in the `openvino` repo root, will configure the build so that the modules other `llama_cpp_plugin` module will not be built to save build time - adjust the `-DBUILD_*` options if you need the other modules as well.
7+
8+
```bash
9+
cmake -B build -DCMAKE_BUILD_TYPE=Release -DOPENVINO_EXTRA_MODULES=<PATH_TO_YOUR_CHECKED_OUT_OPENVINO_CONTRIB>/modules -DBUILD_java_api=OFF -DBUILD_nvidia_plugin=OFF -DBUILD_custom_operations=OFF -DBUILD_openvino_code=OFF -DBUILD_token_merging=OFF -DENABLE_PLUGINS_XML=ON .
10+
```
11+
12+
3. Build the plugin either as part of the complete openvino build by executing:
13+
14+
```bash
15+
cmake --build build -j`nproc`
16+
```
17+
18+
or separately by specifying only the `llama_cpp_plugin` target:
19+
20+
```bash
21+
cmake --build build -j`nproc` -- llama_cpp_plugin
22+
```
23+
24+
4. Now you can utilize the built `libllama_cpp_plugin.so` as a regular OV plugin with the device name `"LLAMA_CPP"` to directly load GGUF files and infer them using OV API with llama.cpp execution under the hood. Make sure that the plugin is discoverable by the OV runtime (e.g. by putting the built `libllama_cpp_plugin.so`, `libllama.so` and the autogenerated `plugins.xml` from the built location to your OV binaries location, or by setting `LD_LIBRARY_PATH` appropriately).
25+
26+
#### Example of LLM inference code
27+
28+
```C++
29+
30+
ov::Core core;
31+
auto model = core.compile_model("model.gguf", "LLAMA_CPP")
32+
auto input_ids = ov::Tensor(ov::element::Type_t::i64, {1, 128});
33+
auto position_ids = ov::Tensor(ov::element::Type_t::i64, {1, 128});
34+
std::iota(position_ids.data<int64_t>(), position_ids.data<int64_t>() + position_ids.get_size(), 0);
35+
36+
auto infer_request == model.create_infer_request();
37+
infer_request.set_tensor("input_ids", input_ids);
38+
infer_request.set_tensor("position_ids", position_ids);
39+
infer_request.infer();
40+
41+
size_t vocab_size = lm.get_tensor("logits").get_shape().back();
42+
float* logits = lm.get_tensor("logits").data<float>() + (input_ids_tensor.get_size() - 1) * vocab_size;
43+
int64_t out_token = std::max_element(logits, logits + vocab_size) - logits;
44+
```
45+
46+
The models obtained by the `.compile_model` call with the `LLAMA_CPP` plugin expose two inputs (`input_ids` and `position_ids`) and a single output (`logits`) with equivalent meaning to the corresponding arguments in the LLM model representations in the huggingface `transformers` repository. The `attention_mask` and `beam_idx` inputs may be set as well, but will have no effect on the execution.
47+
48+
Only batch size of 1 is currently supported.
49+
50+
51+
52+

0 commit comments

Comments
 (0)