Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Support for models from optimum-intel: they are loaded and converted on-the-fly without relying on custom modeling which exists in vLLM. To enable optimum-intel conversion set environment variable VLLM_OPENVINO_OPTIMUM to 1 (while not setting or setting to 0 will switch to vLLM modeling and disable optimum-intel usage).
To make an optimum-intel model compatible with vLLM kv-cache processing and accept vLLM specific parameters, the model is being converted to OpenVINO format inside optimum-intel first and then a number of transformations are applied to produce vLLM compatible form:
vLLM Sampler class is changed because it now accepts logits instead of hidden_states (due to a difference in how the ending MatMul is implemented -- another optimization opportunity that is worth fusing into OV model).
This is a limited support because too narrow patterns are used in transformations and there are explicit and intentional limitations in the code for the purpose of simplification (see TODOs). The glue nodes that make tensor layouts compatible around PageAttentionExtension are not optimized as a part of this PR with the hope (not confirmed) that they will be optimized in compile_model if possible (e.g. repeated Transposes).
Validated manually with: