This is a commandline tool for creating transcripts of conversations recorded as audio files. It uses the following AI models to achieve that:
pyannote.audio
for indentifying speakersOpenAI Whisper
for transcribing what the speakers say into text
Download ffmpeg and install it.
The script requires a token to the huggingface API for downloading
pyannote.audio
models.
Here's how do get one.
In order to download the pyannote.audio
models you need to accept their terms
and conditions. More on that here.
Clone this repository using git:
git clone https://github.com/jannawro/verbatim.git
cd verbatim
python -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
Run:
python verbatim/main.py \
--audio-file sample.mp3 \
--audio-format mp3 \
--hugging-face-token hf_1234567890 \
--speakers 2 \
--output transcript.txt
To see all options run:
python verbatim/main.py --help
Use whisper models variants according to recommendations from OpenAI.
This can be set via the --whisper-model
flag.
From initial testing models smaller than "large" work fine for english. For
satysfying results with other languages "large" is recommended.
Verbatim will spin up the amount of threads specified by --workers
. This
number cannot be greater than the number of your CPU cores. Please note that
each worker will also spin up a separate instance of Whisper for parallel
processing. Use carefully togheter with --whisper-model
to make sure you
have enough resources to spin up the number of Whispers of designated size.