Whisper timestamp fix #1918

RyanMetcalfeInt8 · 2025-03-13T21:36:30Z

…fset

…ment offset

… count

RyanMetcalfeInt8 · 2025-03-17T20:13:10Z

Just added 1 more commit here that aligns the logic between static / dynamic whisper pipelines when it comes to overall produced tokens & max_new_tokens. This if statement was removed from dynamic whisper pipeline some time back, but not from static pipeline... which causes really long audio segments processed by NPU to stop before transcribing the entire duration.

RyanMetcalfeInt8 · 2025-03-20T16:05:04Z

Hi @as-suvorov, could you take a look when you get a chance? thanks!

as-suvorov · 2025-03-24T15:41:57Z

src/cpp/src/whisper/timestamps.hpp

@@ -20,7 +20,8 @@ struct ExtractedSegments {
 ExtractedSegments extract_segments(const std::vector<int64_t>& tokens,
                                   const ov::genai::WhisperGenerationConfig& config,
                                   const size_t nb_max_frames,
-                                   const float time_precision);
+                                   const float time_precision,
+                                   const float toffset = 0.f);


Suggested change

const float toffset = 0.f);

const float time_offset = 0.f);

as-suvorov · 2025-03-24T15:42:17Z

src/cpp/src/whisper/whisper.cpp

    for (size_t chunk_offset = 0; chunk_offset < input_features.n_frames; chunk_offset += segment_offset) {
+
+        const float chunk_toffset = chunk_offset * chunk_length_in_seconds;


Suggested change

const float chunk_toffset = chunk_offset * chunk_length_in_seconds;

const float chunk_time_offset = chunk_offset * chunk_length_in_seconds;

as-suvorov · 2025-03-24T15:46:24Z

src/cpp/src/whisper/whisper.cpp

@@ -295,7 +295,14 @@ WhisperGenerateResult whisper_generate(const ov::genai::WhisperGenerationConfig&
    const float time_precision = static_cast<float>(feature_extractor.chunk_length) / model_config.max_source_positions;
    size_t segment_offset = 0;

+    OPENVINO_ASSERT(feature_extractor.sampling_rate != 0, "Sampling Rate for Feature Extractor is 0");
+    const float chunk_length_in_seconds =


same comment for whisper_pipeline_static.cpp

Suggested change

const float chunk_length_in_seconds =

const float frame_length_in_seconds =

as-suvorov · 2025-03-24T15:51:56Z

@RyanMetcalfeInt8 thanks for PR!
In the gh issue you wrote that you also addressed "offset drift" #1855 (comment). Could you please explain what is the fix?

RyanMetcalfeInt8 · 2025-03-24T15:58:13Z

@RyanMetcalfeInt8 thanks for PR! In the gh issue you wrote that you also addressed "offset drift" #1855 (comment). Could you please explain what is the fix?

sure. Well in this case the 'offset drift' issue that I refer to is caused is within the workaround code submitted by author of that issue:

    auto output = pipeline.generate(input, config);

    // unwrap chunks, TODO: handle error case
    auto chunks = output.chunks.value();

    float offset = 0.0;
    float last_chunk_length = 0.0;

    for (const auto& chunk : chunks) {
        if (chunk.start_ts == 0.0) {
            // this almost works, but offset becomes too small after a while. 
            offset += last_chunk_length;
        }
        last_chunk_length = chunk.end_ts;
        float abs_start = chunk.start_ts + offset;
        float abs_end = chunk.start_ts + offset;
        std::cout << abs_start << " - " << abs_end << ": " << chunk.text << std::endl;
    }

In other words, it's not really a functional workaround as it assumes that all chunk segments will be contiguous with no gaps -- offset keeps getting accumulated by the last_chunk_length. So, the abs_start starts to drift left for each 'gap' (of silence) between each segment.

as-suvorov · 2025-03-24T16:19:46Z

@RyanMetcalfeInt8 thanks for PR! In the gh issue you wrote that you also addressed "offset drift" #1855 (comment). Could you please explain what is the fix?

sure. Well in this case the 'offset drift' issue that I refer to is caused is within the workaround code submitted by author of that issue:
    auto output = pipeline.generate(input, config);

    // unwrap chunks, TODO: handle error case
    auto chunks = output.chunks.value();

    float offset = 0.0;
    float last_chunk_length = 0.0;

    for (const auto& chunk : chunks) {
        if (chunk.start_ts == 0.0) {
            // this almost works, but offset becomes too small after a while. 
            offset += last_chunk_length;
        }
        last_chunk_length = chunk.end_ts;
        float abs_start = chunk.start_ts + offset;
        float abs_end = chunk.start_ts + offset;
        std::cout << abs_start << " - " << abs_end << ": " << chunk.text << std::endl;
    }
In other words, it's not really a functional workaround as it assumes that all chunk segments will be contiguous with no gaps -- offset keeps getting accumulated by the last_chunk_length. So, the abs_start starts to drift left for each 'gap' (of silence) between each segment.

What do you mean by gap between segments? How your implementation addresses this gaps? The code snippet and this PR implementations looks identical to me as chunk.end_ts in code snippet and segment_offset in whisper.cpp are basically the same offset. Maybe I'm missing something.
Maybe you have a sample to share so I can check this as well?

RyanMetcalfeInt8 · 2025-03-24T16:32:55Z

What do you mean by gap between segments? How your implementation addresses this gaps? The code snippet and this PR implementations looks identical to me as chunk.end_ts in code snippet and segment_offset in whisper.cpp are basically the same offset. Maybe I'm missing something. Maybe you have a sample to share so I can check this as well?

So for the 'fix' in this PR, an absolute time is calculated for each iteration of this loop: for (size_t chunk_offset = 0; chunk_offset < input_features.n_frames; chunk_offset += segment_offset), given chunk_offset. Essentially, given chunk_offset, calculate an absolute time offset, which then is added to the start / end time for the segments generated.

This offset ends up being a little bit (or a lot in some cases) different than if you were to accumulate (chunk.end_ts - chunk.start_ts).

Sure, let me prepare a sample with some illustrations from the Audacity project where it's pretty clear what the difference is.

RyanMetcalfeInt8 added 2 commits March 13, 2025 13:30

whisper: Account for absolute time of each chunk to use as segment of…

3c02522

…fset

whisper static: Account for absolute time of each chunk to use as seg…

615c3fa

…ment offset

github-actions bot added the category: whisper Whisper pipeline label Mar 13, 2025

add toffset to missed m_start in extract_segments

d2cab4d

ilya-lavrenov assigned as-suvorov Mar 14, 2025

ilya-lavrenov added this to the 2025.2 milestone Mar 14, 2025

ilya-lavrenov added the bug Something isn't working label Mar 14, 2025

whisper_pipeline_static: Don't break out of chunk loop based on token…

3681d24

… count

RyanMetcalfeInt8 added 3 commits March 17, 2025 19:34

Merge branch 'master' into whisper_timestamp_fix

8c84347

Merge branch 'master' into whisper_timestamp_fix

e871749

Merge branch 'master' into whisper_timestamp_fix

f3a957d

as-suvorov reviewed Mar 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper timestamp fix #1918

Whisper timestamp fix #1918

RyanMetcalfeInt8 commented Mar 13, 2025 •

edited by ilya-lavrenov

Loading

RyanMetcalfeInt8 commented Mar 17, 2025

RyanMetcalfeInt8 commented Mar 20, 2025

as-suvorov Mar 24, 2025

as-suvorov Mar 24, 2025

as-suvorov Mar 24, 2025

as-suvorov commented Mar 24, 2025

RyanMetcalfeInt8 commented Mar 24, 2025 •

edited

Loading

as-suvorov commented Mar 24, 2025

RyanMetcalfeInt8 commented Mar 24, 2025

		for (size_t chunk_offset = 0; chunk_offset < input_features.n_frames; chunk_offset += segment_offset) {

		const float chunk_toffset = chunk_offset * chunk_length_in_seconds;

	const float chunk_toffset = chunk_offset * chunk_length_in_seconds;
	const float chunk_time_offset = chunk_offset * chunk_length_in_seconds;

	const float chunk_length_in_seconds =
	const float frame_length_in_seconds =

Whisper timestamp fix #1918

Are you sure you want to change the base?

Whisper timestamp fix #1918

Conversation

RyanMetcalfeInt8 commented Mar 13, 2025 • edited by ilya-lavrenov Loading

RyanMetcalfeInt8 commented Mar 17, 2025

RyanMetcalfeInt8 commented Mar 20, 2025

as-suvorov Mar 24, 2025

Choose a reason for hiding this comment

as-suvorov Mar 24, 2025

Choose a reason for hiding this comment

as-suvorov Mar 24, 2025

Choose a reason for hiding this comment

as-suvorov commented Mar 24, 2025

RyanMetcalfeInt8 commented Mar 24, 2025 • edited Loading

as-suvorov commented Mar 24, 2025

RyanMetcalfeInt8 commented Mar 24, 2025

RyanMetcalfeInt8 commented Mar 13, 2025 •

edited by ilya-lavrenov

Loading

RyanMetcalfeInt8 commented Mar 24, 2025 •

edited

Loading