LinuxCommandLibrary

deepspeech

TLDR

Transcribe an audio file

$ deepspeech --model [model.pbmm] --audio [audio.wav]
copy
Transcribe with scorer (language model)
$ deepspeech --model [model.pbmm] --scorer [scorer.scorer] --audio [audio.wav]
copy
Transcribe with extended output
$ deepspeech --model [model.pbmm] --audio [audio.wav] --extended
copy
Transcribe using TFLite model
$ deepspeech --model [model.tflite] --audio [audio.wav]
copy
Stream audio from microphone (with Python)
$ python -c "import deepspeech; ..."
copy

SYNOPSIS

deepspeech --model model --audio audio [options]

DESCRIPTION

DeepSpeech is an open-source speech-to-text engine based on deep learning. It uses an end-to-end neural network architecture to convert audio into text transcriptions.
The system requires a trained model and optionally an external scorer (language model) for improved accuracy. Pre-trained English models are available, and the toolkit supports training custom models for other languages or domains.
Audio input must be 16kHz, 16-bit, mono WAV format. The tool supports both batch transcription of files and real-time streaming transcription through its API.

PARAMETERS

--model file

Path to the model file (.pbmm or .tflite).
--scorer file
Path to external scorer/language model.
--audio file
Audio file to transcribe (16kHz, 16-bit, mono WAV).
--extended
Output word timing and confidence.
--json
Output results as JSON.
--candidate_transcripts n
Number of alternative transcriptions.
--hot_words words
Boost probability of specific words.
--version
Display version information.

PYTHON API

$ import deepspeech
import wave

model = deepspeech.Model('model.pbmm')
model.enableExternalScorer('scorer.scorer')

with wave.open('audio.wav', 'rb') as w:
    audio = w.readframes(w.getnframes())

text = model.stt(audio)
print(text)
copy

CAVEATS

Accuracy depends on audio quality and acoustic similarity to training data. Models are large (hundreds of MB). GPU acceleration requires specific TensorFlow builds. Project development has slowed; consider alternatives like Whisper for new projects.

HISTORY

DeepSpeech was developed by Mozilla starting in 2017 as part of their Common Voice project to create open-source voice technology. Based on research by Baidu, it used recurrent neural networks for speech recognition. Mozilla discontinued active development in 2021 after layoffs, but the project was forked and continued by the community as Coqui STT.

SEE ALSO

Copied to clipboard