You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR adds:
- [x] support perf metrics
Common Todos for Whisper support:
- [ ] Long-form audio support with [parallel
chunking](https://huggingface.co/blog/asr-chunking).
- [ ] update documentation
- [ ] add cpp, python samples tests
- [ ] support timestamps streaming
- [ ] expose only meaningful parameters in `GenerationConfig` (`task`,
`language`, `return_timestamps`, etc)
- [ ] Move all whisper pipeline files to dedicated subfolder
- [ ] Whisper pipeline doesn't need tokenizer, it uses detokenizer only.
Implement detokenizer only initialization for `ov::genai::Tokenizer`
- [ ] Check discrete GPU. Integrated GPU works as expected.
- [ ] Investigate use of `RemoteTensor` for GPU
- [ ] Add batch
- [ ] Add sampler, inherit WhisperGenerationConfig from GenerationConfig
- [ ] Investigate language autodetection with single decoder (without
past) call
- [ ] Update python bindings cmake to include whole directory instead of
explicit list of files
- [ ] Add samples with audio preparation examples
- [ ] Add links to audio files so users can download them in samples
- [ ] Move supported models list from samples README to common supported
models section
- [ ] Avoid building GenAI in each tests job as it takes a lot of time
- [ ] Double check FP32 support
- [ ] Fix tests sporadic fails. Sometimes whisper model cannot be
downloaded from HF due to network issues
- [ ] Fix stop criteria. Current approach stops on eos_token which is no
speech token. But there could be more speech tokens further which are
wrongly skipped now
- [ ] Fix distil whisper accuracy, match with HF
- [ ] Fix en models accuracy with timestamps, match with HF
- [ ] Try to trim input_ids cache between chunks for long-form audio to
match HF
Completed:
- [x] support different languages, language autodetection
- [x] support translation
- [x] support timestamps
- [x] Long-form audio support with sequential chunking.
Current limitations:
- No resampling during preprocessing. Input raw speech should have 16k
Hz sampling rate
- No normalization during preprocessing. Input raw speech should be
normalized to near [-1, 1] range
Tickets: CVS-147994, CVS-146010, CVS-152523
0 commit comments