Whisper CPP Voice Transcription

Table of Contents

What It Does
#

Want to transcribe some voice recordings quickly, without uploading them to a third-party service? Whisper CPP does that. It advertises itself as a “high-performance” (C/C++) “inference of OpenAI’s Whisper automatic speech recognition (ASR) model”.

It’s impressive. With zero training and without using a huge speech model, it handled a bunch of dictation files I made on my phone. There were very few errors. And it did the job fast, even on a PC with a 10-year old quad-core Intel integrated CPU, no GPU installed.

Best of all, it’s easy to install and uncomplicated to run.

Install
#

Follow the instructions from the Github page:

# Change to the directory where you want to clone the repo
cd ~/git
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
# Download the base speech model
bash ./models/download-ggml-model.sh base.en
# Build
make
# (Optional): install another speech model
make small.en

Try it out
#

Whisper comes with a sample file.

# Default text output (-otxt)
./main -f samples/jfk.wav
# .vtt text format (time markers)
./main -f samples/jfk.wav -ovtt

Transcode recordings
#

Whisper requires 16-bit WAV format audio files. I dictate to an Android app Easy Voice Recorder, which saves speech as mp3 files. I fire up ffmpeg to convert the .mp3 to .wav.

ffmpeg -i "my_recording.mp3" -ar 16000 -ac 1 -c:a pcm_s16le test.wav

Transcribe the WAV files
#

You can now generate the transcriptions, specifying the path to the speech model you want to use (-m parameter). For example, you might want to compare the quality of two models:

# Use base model (140 mb)
~/git/whisper.cpp/main -m ~/git/whisper.cpp/models/ggml-base.en.bin -f test.wav
# Use small model (460 mb)
~/git/whisper.cpp/main -m ~/git/whisper.cpp/models/ggml-small.en.bin -f test.wav

I found that the larger model processed slower, without much improvement in accuracy. I’ve stuck with the base model.

Workflow
#

I make my recordings in chunks of several minutes each, to reduce the risk of dictating a long file and failing to save it. I then run simple one-line commands to transcode and transcribe the batch. Easy Voice Recorder numbers the files serially. I loop through the range like this:

# Convert from .mp3 to .wav:
for i in 21 22 23 24 25 26 27; do ffmpeg -i "My recording $i.mp3" -ar 16000 -ac 1 -c:a pcm_s16le "My recording $i.wav"; done

# Transcribe to multiple text formats:
for i in 21 22 23 24 25 26 27; do ~/git/whisper.cpp/main -m ~/git/whisper.cpp/models/ggml-base.en.bin -f "My recording $i.wav" -otxt -ovtt -osrt; done