Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for optimum-intel models #4

Merged
merged 6 commits into from
Feb 27, 2024
Merged

Support for optimum-intel models #4

merged 6 commits into from
Feb 27, 2024

Conversation

slyalin
Copy link
Owner

@slyalin slyalin commented Feb 26, 2024

Support for models from optimum-intel: they are loaded and converted on-the-fly without relying on custom modeling which exists in vLLM. To enable optimum-intel conversion set environment variable VLLM_OPENVINO_OPTIMUM to 1 (while not setting or setting to 0 will switch to vLLM modeling and disable optimum-intel usage).

To make an optimum-intel model compatible with vLLM kv-cache processing and accept vLLM specific parameters, the model is being converted to OpenVINO format inside optimum-intel first and then a number of transformations are applied to produce vLLM compatible form:

  1. ReadValue -> ... -> SDPA pattern with several variations is recognized and replaced by PagedAttentionExtension with a few gluing nodes.
  2. attention_mask input if exists in the original model and used for position_ids generation is replaced by a dedicated position_ids parameter.
  3. Specific patterns that get max_context_len is replaced by a dedicated parameter.
  4. Automatically enabled weights compression that usually happens while exporting big models from optimum-intel, is temporary disabled due to visible difference in output which complicates accuracy validation.

vLLM Sampler class is changed because it now accepts logits instead of hidden_states (due to a difference in how the ending MatMul is implemented -- another optimization opportunity that is worth fusing into OV model).

This is a limited support because too narrow patterns are used in transformations and there are explicit and intentional limitations in the code for the purpose of simplification (see TODOs). The glue nodes that make tensor layouts compatible around PageAttentionExtension are not optimized as a part of this PR with the hope (not confirmed) that they will be optimized in compile_model if possible (e.g. repeated Transposes).

Validated manually with:

  • opt-125, tiny-llama, mistral-7b, EleutherAI/gpt-j-6b (matches vLLM original modeling also in OV),
  • Qwen/Qwen1.5-0.5B-Chat (generate consistently, no comparison as no vLLM custom modeling for Qwen2 arch; based on not yet committed add openvino export configs huggingface/optimum-intel#568).

@slyalin slyalin marked this pull request as ready for review February 26, 2024 13:58
@slyalin slyalin merged commit 8a9862f into openvino Feb 27, 2024
slyalin pushed a commit that referenced this pull request Mar 21, 2024
…utor

Adapt OpenVINO CPU plugin implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants