Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper timestamp fix #1918

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

RyanMetcalfeInt8
Copy link
Contributor

@RyanMetcalfeInt8 RyanMetcalfeInt8 commented Mar 13, 2025

Closes #1855

@github-actions github-actions bot added the category: whisper Whisper pipeline label Mar 13, 2025
@ilya-lavrenov ilya-lavrenov added this to the 2025.2 milestone Mar 14, 2025
@ilya-lavrenov ilya-lavrenov added the bug Something isn't working label Mar 14, 2025
@RyanMetcalfeInt8
Copy link
Contributor Author

Just added 1 more commit here that aligns the logic between static / dynamic whisper pipelines when it comes to overall produced tokens & max_new_tokens. This if statement was removed from dynamic whisper pipeline some time back, but not from static pipeline... which causes really long audio segments processed by NPU to stop before transcribing the entire duration.

@RyanMetcalfeInt8
Copy link
Contributor Author

Hi @as-suvorov, could you take a look when you get a chance? thanks!

@@ -20,7 +20,8 @@ struct ExtractedSegments {
ExtractedSegments extract_segments(const std::vector<int64_t>& tokens,
const ov::genai::WhisperGenerationConfig& config,
const size_t nb_max_frames,
const float time_precision);
const float time_precision,
const float toffset = 0.f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const float toffset = 0.f);
const float time_offset = 0.f);

for (size_t chunk_offset = 0; chunk_offset < input_features.n_frames; chunk_offset += segment_offset) {

const float chunk_toffset = chunk_offset * chunk_length_in_seconds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const float chunk_toffset = chunk_offset * chunk_length_in_seconds;
const float chunk_time_offset = chunk_offset * chunk_length_in_seconds;

@@ -295,7 +295,14 @@ WhisperGenerateResult whisper_generate(const ov::genai::WhisperGenerationConfig&
const float time_precision = static_cast<float>(feature_extractor.chunk_length) / model_config.max_source_positions;
size_t segment_offset = 0;

OPENVINO_ASSERT(feature_extractor.sampling_rate != 0, "Sampling Rate for Feature Extractor is 0");
const float chunk_length_in_seconds =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment for whisper_pipeline_static.cpp

Suggested change
const float chunk_length_in_seconds =
const float frame_length_in_seconds =

@as-suvorov
Copy link
Contributor

@RyanMetcalfeInt8 thanks for PR!
In the gh issue you wrote that you also addressed "offset drift" #1855 (comment). Could you please explain what is the fix?

@RyanMetcalfeInt8
Copy link
Contributor Author

RyanMetcalfeInt8 commented Mar 24, 2025

@RyanMetcalfeInt8 thanks for PR! In the gh issue you wrote that you also addressed "offset drift" #1855 (comment). Could you please explain what is the fix?

sure. Well in this case the 'offset drift' issue that I refer to is caused is within the workaround code submitted by author of that issue:

    auto output = pipeline.generate(input, config);

    // unwrap chunks, TODO: handle error case
    auto chunks = output.chunks.value();

    float offset = 0.0;
    float last_chunk_length = 0.0;

    for (const auto& chunk : chunks) {
        if (chunk.start_ts == 0.0) {
            // this almost works, but offset becomes too small after a while. 
            offset += last_chunk_length;
        }
        last_chunk_length = chunk.end_ts;
        float abs_start = chunk.start_ts + offset;
        float abs_end = chunk.start_ts + offset;
        std::cout << abs_start << " - " << abs_end << ": " << chunk.text << std::endl;
    }

In other words, it's not really a functional workaround as it assumes that all chunk segments will be contiguous with no gaps -- offset keeps getting accumulated by the last_chunk_length. So, the abs_start starts to drift left for each 'gap' (of silence) between each segment.

@as-suvorov
Copy link
Contributor

@RyanMetcalfeInt8 thanks for PR! In the gh issue you wrote that you also addressed "offset drift" #1855 (comment). Could you please explain what is the fix?

sure. Well in this case the 'offset drift' issue that I refer to is caused is within the workaround code submitted by author of that issue:

    auto output = pipeline.generate(input, config);

    // unwrap chunks, TODO: handle error case
    auto chunks = output.chunks.value();

    float offset = 0.0;
    float last_chunk_length = 0.0;

    for (const auto& chunk : chunks) {
        if (chunk.start_ts == 0.0) {
            // this almost works, but offset becomes too small after a while. 
            offset += last_chunk_length;
        }
        last_chunk_length = chunk.end_ts;
        float abs_start = chunk.start_ts + offset;
        float abs_end = chunk.start_ts + offset;
        std::cout << abs_start << " - " << abs_end << ": " << chunk.text << std::endl;
    }

In other words, it's not really a functional workaround as it assumes that all chunk segments will be contiguous with no gaps -- offset keeps getting accumulated by the last_chunk_length. So, the abs_start starts to drift left for each 'gap' (of silence) between each segment.

What do you mean by gap between segments? How your implementation addresses this gaps? The code snippet and this PR implementations looks identical to me as chunk.end_ts in code snippet and segment_offset in whisper.cpp are basically the same offset. Maybe I'm missing something.
Maybe you have a sample to share so I can check this as well?

@RyanMetcalfeInt8
Copy link
Contributor Author

What do you mean by gap between segments? How your implementation addresses this gaps? The code snippet and this PR implementations looks identical to me as chunk.end_ts in code snippet and segment_offset in whisper.cpp are basically the same offset. Maybe I'm missing something. Maybe you have a sample to share so I can check this as well?

So for the 'fix' in this PR, an absolute time is calculated for each iteration of this loop: for (size_t chunk_offset = 0; chunk_offset < input_features.n_frames; chunk_offset += segment_offset), given chunk_offset. Essentially, given chunk_offset, calculate an absolute time offset, which then is added to the start / end time for the segments generated.

This offset ends up being a little bit (or a lot in some cases) different than if you were to accumulate (chunk.end_ts - chunk.start_ts).

Sure, let me prepare a sample with some illustrations from the Audacity project where it's pretty clear what the difference is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working category: whisper Whisper pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Whisper Pipeline] WhisperDecodedResultChunk::start_ts and end_ts without offset?
3 participants