optimize first latency beam search for OVModelForCausalLM #695

eaidova · 2024-04-30T06:43:57Z

What does this PR do?

this PR reduces first token latency for OVModelForCausalLM class if beam search decoding selected. Beam search represented during generation as batch of sequences (generation batch size = [num_input_promts * num_beams]). Generation API duplicates initial input sequence for promoting them for each beam before starting work, while on the first step all sequences are equal (in the same time, the first inference for models with cache is the most time-consuming part). The idea is postpone sequence duplication for beams after first iteration done (including duplication of past key values and logits in outputs)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-04-30T06:49:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

echarlaix · 2024-04-30T16:04:14Z

optimum/intel/openvino/modeling_decoder.py

@@ -651,6 +764,954 @@ def _from_pretrained(

        return causal_model

+    def _beam_search(


Could you specify using comments which part has been modified from the original code, this would help a lot for review and future maintenance

tests/openvino/test_modeling.py

optimum/intel/openvino/modeling_decoder.py

echarlaix

What is the observed performance gain for the first token generation ? There is a lot of method overwritten in this PR which could lead to potential issues in terms of future transformers compatibility for generation, so would like to make sure the performance gain is significant before considering merging

optimum/intel/openvino/modeling_decoder.py

IlyasMoutawwakil · 2024-05-07T08:07:16Z

IMO this will be very heavy to maintain with the constant changes in transformers lib, especially since the text generation api will be undergoing heavy refactorization soon.

Would it not make sense to instead of optimizing the generation strategy, rather optimize the first forward pass, with something along the lines of:

def generate():
    if beam_search: # or any generation strategy where this issue is observed
        self.first_beam_search_iteration = True
     else:
        self.first_beam_search_iteration = False
    
    return super().generate()

def forward():
    if self.first_beam_search_iteration :
        unique_inputs, inverse_order = torch.unique(inputs, dim=0, return_inverse=True) 
        # we can also use what we know about how the inputs are duplicated to deduplicate them
        unique_outputs = super().forward(unique_inputs)
        outputs = unique_inputs[inverse_order]
        self.first_beam_search_iteration = False
    else:
        outputs = super().forward(inputs)
    return outputs

I admit that this is more stateful and hacky than what's suggested in the PR, but it requires maintaining less code, until this duplication issue with beam search gets fixed in transformers.

eaidova · 2024-05-13T04:41:44Z

@IlyasMoutawwakil, thank you for your suggestion, that is from what I begin, but problem that we need to know how inputs was duplicated for nonstateful case to duplicate past key values and this required additional context for that (from generation config) that is not provided inside forward. Another problem is next_beam_idx that should be different before second inference (contains initial index duplication instead of arranged indices from cache reordering)

eaidova · 2024-05-13T12:29:54Z

@IlyasMoutawwakil @echarlaix please take a look one more time, I significantly updated code for reducing overriding beam search methods

optimum/intel/openvino/modeling_decoder.py

tests/openvino/test_modeling.py

echarlaix

Looks great thanks for iterating on this @eaidova

eaidova changed the title ~~Ea/optimize first latency beam search for OVModelForCausalLM~~ optimize first latency beam search for OVModelForCausalLM Apr 30, 2024

echarlaix reviewed Apr 30, 2024

View reviewed changes

echarlaix reviewed May 3, 2024

View reviewed changes

optimum/intel/openvino/modeling_decoder.py Show resolved Hide resolved

optimum/intel/openvino/modeling_decoder.py Outdated Show resolved Hide resolved

optimum/intel/openvino/modeling_decoder.py Outdated Show resolved Hide resolved

echarlaix requested a review from IlyasMoutawwakil May 3, 2024 16:04

eaidova force-pushed the ea/optimize_first_latency_beam_search branch from 86c2baf to d42574a Compare May 13, 2024 07:05

eaidova added 4 commits May 13, 2024 16:27

WIP: beam search only

d363a00

other beam search algos

f263f3f

add test

df8a5c6

do not touch decoding cycles

f82ff06

eaidova force-pushed the ea/optimize_first_latency_beam_search branch from 60f55ca to d216e3a Compare May 13, 2024 12:28

eaidova requested a review from echarlaix May 13, 2024 12:30

fix stateless model support

Loading
Loading status checks…

0dbb104

eaidova force-pushed the ea/optimize_first_latency_beam_search branch from d216e3a to 0dbb104 Compare May 13, 2024 13:16

fix quantization

Loading
Loading status checks…

59c8c40

eaidova force-pushed the ea/optimize_first_latency_beam_search branch from cbd5274 to 59c8c40 Compare May 14, 2024 08:50

echarlaix reviewed May 14, 2024

View reviewed changes

eaidova added 2 commits May 15, 2024 09:18

move inputs modification into forward

Loading
Loading status checks…

daecdac

refactor test

Loading
Loading status checks…

b1fc04b

eaidova force-pushed the ea/optimize_first_latency_beam_search branch from 40c26f7 to b1fc04b Compare May 15, 2024 05:29

eaidova requested a review from echarlaix May 15, 2024 06:54

echarlaix approved these changes May 15, 2024

View reviewed changes

echarlaix merged commit 2b902bb into huggingface:main May 15, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize first latency beam search for OVModelForCausalLM #695

optimize first latency beam search for OVModelForCausalLM #695

eaidova commented Apr 30, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 30, 2024

echarlaix Apr 30, 2024

IlyasMoutawwakil May 6, 2024

echarlaix left a comment •

edited

Loading

IlyasMoutawwakil commented May 7, 2024

eaidova commented May 13, 2024 •

edited

Loading

eaidova commented May 13, 2024

echarlaix left a comment

		@@ -651,6 +764,954 @@ def _from_pretrained(

		return causal_model

		def _beam_search(

optimize first latency beam search for OVModelForCausalLM #695

optimize first latency beam search for OVModelForCausalLM #695

Conversation

eaidova commented Apr 30, 2024 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Apr 30, 2024

echarlaix Apr 30, 2024

Choose a reason for hiding this comment

IlyasMoutawwakil May 6, 2024

Choose a reason for hiding this comment

echarlaix left a comment • edited Loading

Choose a reason for hiding this comment

IlyasMoutawwakil commented May 7, 2024

eaidova commented May 13, 2024 • edited Loading

eaidova commented May 13, 2024

echarlaix left a comment

Choose a reason for hiding this comment

eaidova commented Apr 30, 2024 •

edited

Loading

echarlaix left a comment •

edited

Loading

eaidova commented May 13, 2024 •

edited

Loading