-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper timestamp fix #1918
base: master
Are you sure you want to change the base?
Whisper timestamp fix #1918
Conversation
Just added 1 more commit here that aligns the logic between static / dynamic whisper pipelines when it comes to overall produced tokens & max_new_tokens. This if statement was removed from dynamic whisper pipeline some time back, but not from static pipeline... which causes really long audio segments processed by NPU to stop before transcribing the entire duration. |
Hi @as-suvorov, could you take a look when you get a chance? thanks! |
@@ -20,7 +20,8 @@ struct ExtractedSegments { | |||
ExtractedSegments extract_segments(const std::vector<int64_t>& tokens, | |||
const ov::genai::WhisperGenerationConfig& config, | |||
const size_t nb_max_frames, | |||
const float time_precision); | |||
const float time_precision, | |||
const float toffset = 0.f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const float toffset = 0.f); | |
const float time_offset = 0.f); |
for (size_t chunk_offset = 0; chunk_offset < input_features.n_frames; chunk_offset += segment_offset) { | ||
|
||
const float chunk_toffset = chunk_offset * chunk_length_in_seconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const float chunk_toffset = chunk_offset * chunk_length_in_seconds; | |
const float chunk_time_offset = chunk_offset * chunk_length_in_seconds; |
@@ -295,7 +295,14 @@ WhisperGenerateResult whisper_generate(const ov::genai::WhisperGenerationConfig& | |||
const float time_precision = static_cast<float>(feature_extractor.chunk_length) / model_config.max_source_positions; | |||
size_t segment_offset = 0; | |||
|
|||
OPENVINO_ASSERT(feature_extractor.sampling_rate != 0, "Sampling Rate for Feature Extractor is 0"); | |||
const float chunk_length_in_seconds = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment for whisper_pipeline_static.cpp
const float chunk_length_in_seconds = | |
const float frame_length_in_seconds = |
@RyanMetcalfeInt8 thanks for PR! |
sure. Well in this case the 'offset drift' issue that I refer to is caused is within the workaround code submitted by author of that issue:
In other words, it's not really a functional workaround as it assumes that all chunk segments will be contiguous with no gaps -- offset keeps getting accumulated by the |
What do you mean by gap between segments? How your implementation addresses this gaps? The code snippet and this PR implementations looks identical to me as |
So for the 'fix' in this PR, an absolute time is calculated for each iteration of this loop: This offset ends up being a little bit (or a lot in some cases) different than if you were to accumulate (chunk.end_ts - chunk.start_ts). Sure, let me prepare a sample with some illustrations from the Audacity project where it's pretty clear what the difference is. |
Closes #1855