What It Does #
Want to transcribe some voice recordings quickly, without uploading them to a third-party service? Whisper CPP does that. It advertises itself as a “high-performance” (C/C++) “inference of OpenAI’s Whisper automatic speech recognition (ASR) model”.
It’s impressive. With zero training and without using a huge speech model, it handled a bunch of dictation files I made on my phone. There were very few errors. And it did the job fast, even on a PC with a 10-year old quad-core Intel integrated CPU, no GPU installed.
Best of all, it’s easy to install and uncomplicated to run.
Install #
Follow the instructions from the Github page:
# Change to the directory where you want to clone the repo
cd ~/git
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
# Download the base speech model
bash ./models/download-ggml-model.sh base.en
# Build
make
# (Optional): install another speech model
make small.en
Try it out #
Whisper comes with a sample file.
# Default text output (-otxt)
./main -f samples/jfk.wav
# .vtt text format (time markers)
./main -f samples/jfk.wav -ovtt
Transcode recordings #
Whisper requires 16-bit WAV
format audio files. I dictate to an Android app
Easy Voice Recorder, which saves speech as mp3
files. I fire up
ffmpeg to convert the .mp3
to .wav
.
ffmpeg -i "my_recording.mp3" -ar 16000 -ac 1 -c:a pcm_s16le test.wav
Transcribe the WAV files #
You can now generate the transcriptions, specifying the path to the speech model you want to use (-m
parameter). For example, you might want to compare the quality of two models:
# Use base model (140 mb)
~/git/whisper.cpp/main -m ~/git/whisper.cpp/models/ggml-base.en.bin -f test.wav
# Use small model (460 mb)
~/git/whisper.cpp/main -m ~/git/whisper.cpp/models/ggml-small.en.bin -f test.wav
I found that the larger model processed slower, without much improvement in accuracy. I’ve stuck with the base model.
Workflow #
I make my recordings in chunks of several minutes each, to reduce the risk of dictating a long file and failing to save it. I then run simple one-line commands to transcode and transcribe the batch. Easy Voice Recorder numbers the files serially. I loop through the range like this:
# Convert from .mp3 to .wav:
for i in 21 22 23 24 25 26 27; do ffmpeg -i "My recording $i.mp3" -ar 16000 -ac 1 -c:a pcm_s16le "My recording $i.wav"; done
# Transcribe to multiple text formats:
for i in 21 22 23 24 25 26 27; do ~/git/whisper.cpp/main -m ~/git/whisper.cpp/models/ggml-base.en.bin -f "My recording $i.wav" -otxt -ovtt -osrt; done