Support for optimum-intel models #4

slyalin · 2024-02-26T13:46:40Z

Support for models from optimum-intel: they are loaded and converted on-the-fly without relying on custom modeling which exists in vLLM. To enable optimum-intel conversion set environment variable VLLM_OPENVINO_OPTIMUM to 1 (while not setting or setting to 0 will switch to vLLM modeling and disable optimum-intel usage).

To make an optimum-intel model compatible with vLLM kv-cache processing and accept vLLM specific parameters, the model is being converted to OpenVINO format inside optimum-intel first and then a number of transformations are applied to produce vLLM compatible form:

ReadValue -> ... -> SDPA pattern with several variations is recognized and replaced by PagedAttentionExtension with a few gluing nodes.
attention_mask input if exists in the original model and used for position_ids generation is replaced by a dedicated position_ids parameter.
Specific patterns that get max_context_len is replaced by a dedicated parameter.
Automatically enabled weights compression that usually happens while exporting big models from optimum-intel, is temporary disabled due to visible difference in output which complicates accuracy validation.

vLLM Sampler class is changed because it now accepts logits instead of hidden_states (due to a difference in how the ending MatMul is implemented -- another optimization opportunity that is worth fusing into OV model).

This is a limited support because too narrow patterns are used in transformations and there are explicit and intentional limitations in the code for the purpose of simplification (see TODOs). The glue nodes that make tensor layouts compatible around PageAttentionExtension are not optimized as a part of this PR with the hope (not confirmed) that they will be optimized in compile_model if possible (e.g. repeated Transposes).

Validated manually with:

opt-125, tiny-llama, mistral-7b, EleutherAI/gpt-j-6b (matches vLLM original modeling also in OV),
Qwen/Qwen1.5-0.5B-Chat (generate consistently, no comparison as no vLLM custom modeling for Qwen2 arch; based on not yet committed add openvino export configs huggingface/optimum-intel#568).

…tics.

…t8 compression to align on accuracy.

…ames for PA-specific parameters.

…ed model serialization.

vllm/worker/model_runner.py

…utor Adapt OpenVINO CPU plugin implementation

slyalin added 6 commits February 22, 2024 18:56

WIP: Support for optimum-intel converted models (opt works)

5b020b4

Extended patterns to support also llama model family, cleaner diagnos…

4f1dbdd

…tics.

Incomplete attempt to support stateless models

1a5640c

Adopted pattern to recently optimized StridedSlice in OV. Disabled in…

460167c

…t8 compression to align on accuracy.

Optimum-based models: Q-shape aware default scale compute; friendly n…

a34722d

…ames for PA-specific parameters.

Deduplicated model wrapper code for optimum and vllm modeling. Disabl…

d6a323a

…ed model serialization.

slyalin marked this pull request as ready for review February 26, 2024 13:58

slyalin merged commit 8a9862f into openvino Feb 27, 2024

ilya-lavrenov reviewed Feb 27, 2024

View reviewed changes

vllm/worker/model_runner.py Show resolved Hide resolved

slyalin pushed a commit that referenced this pull request Mar 21, 2024

Merge pull request #4 from luo-cheng2021/luocheng/openvino-model-exec…

658407a

…utor Adapt OpenVINO CPU plugin implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for optimum-intel models #4

Support for optimum-intel models #4

slyalin commented Feb 26, 2024 •

edited

Loading

Support for optimum-intel models #4

Support for optimum-intel models #4

Conversation

slyalin commented Feb 26, 2024 • edited Loading

slyalin commented Feb 26, 2024 •

edited

Loading