deepspeech
TLDR
Transcribe an audio file
SYNOPSIS
deepspeech --model model --audio audio [options]
DESCRIPTION
DeepSpeech is an open-source speech-to-text engine based on deep learning. It uses an end-to-end neural network architecture to convert audio into text transcriptions.
The system requires a trained model and optionally an external scorer (language model) for improved accuracy. Pre-trained English models are available, and the toolkit supports training custom models for other languages or domains.
Audio input must be 16kHz, 16-bit, mono WAV format. The tool supports both batch transcription of files and real-time streaming transcription through its API.
PARAMETERS
--model file
Path to the model file (.pbmm or .tflite).--scorer file
Path to external scorer/language model.--audio file
Audio file to transcribe (16kHz, 16-bit, mono WAV).--extended
Output word timing and confidence.--json
Output results as JSON.--candidate_transcripts n
Number of alternative transcriptions.--hot_words words
Boost probability of specific words.--version
Display version information.
PYTHON API
import wave
model = deepspeech.Model('model.pbmm')
model.enableExternalScorer('scorer.scorer')
with wave.open('audio.wav', 'rb') as w:
audio = w.readframes(w.getnframes())
text = model.stt(audio)
print(text)
CAVEATS
Accuracy depends on audio quality and acoustic similarity to training data. Models are large (hundreds of MB). GPU acceleration requires specific TensorFlow builds. Project development has slowed; consider alternatives like Whisper for new projects.
HISTORY
DeepSpeech was developed by Mozilla starting in 2017 as part of their Common Voice project to create open-source voice technology. Based on research by Baidu, it used recurrent neural networks for speech recognition. Mozilla discontinued active development in 2021 after layoffs, but the project was forked and continued by the community as Coqui STT.
SEE ALSO
whisper(1), vosk(1), pocketsphinx(1)


