LinuxCommandLibrary
GitHubF-DroidGoogle Play Store

deepspeech

open-source speech-to-text engine

TLDR

Transcribe an audio file
$ deepspeech --model [model.pbmm] --audio [audio.wav]
copy
Transcribe with scorer (language model)
$ deepspeech --model [model.pbmm] --scorer [scorer.scorer] --audio [audio.wav]
copy
Transcribe with extended output
$ deepspeech --model [model.pbmm] --audio [audio.wav] --extended
copy
Transcribe using TFLite model
$ deepspeech --model [model.tflite] --audio [audio.wav]
copy
Set beam width for CTC decoder
$ deepspeech --model [model.pbmm] --audio [audio.wav] --beam_width [500]
copy

SYNOPSIS

deepspeech --model model --audio audio [options]

DESCRIPTION

DeepSpeech is an open-source speech-to-text engine based on deep learning. It uses an end-to-end neural network architecture to convert audio into text transcriptions.The system requires a trained model and optionally an external scorer (language model) for improved accuracy. Pre-trained English models are available, and the toolkit supports training custom models for other languages or domains.Audio input must be 16kHz, 16-bit, mono WAV format. The tool supports both batch transcription of files and real-time streaming transcription through its API.

PARAMETERS

--model file

Path to the model file (.pbmm or .tflite).
--scorer file
Path to external scorer/language model.
--audio file
Audio file to transcribe (16kHz, 16-bit, mono WAV).
--extended
Output word timing and confidence.
--json
Output results as JSON.
--beam_width n
Beam width for the CTC decoder.
--lm_alpha value
Language model weight. If not specified, uses default from the scorer package.
--lm_beta value
Word insertion bonus. If not specified, uses default from the scorer package.
--candidate_transcripts n
Number of candidate transcripts to include in JSON output (default: 3).
--hot_words words
Hot-words and their probability boosts.
--version
Print version and exit.

PYTHON API

$ import deepspeech
import wave

model = deepspeech.Model('model.pbmm')
model.enableExternalScorer('scorer.scorer')

with wave.open('audio.wav', 'rb') as w:
    audio = w.readframes(w.getnframes())

text = model.stt(audio)
print(text)
copy

CAVEATS

Accuracy depends on audio quality and acoustic similarity to training data. Models are large (hundreds of MB). GPU acceleration requires specific TensorFlow builds. Project development has slowed; consider alternatives like Whisper for new projects.

HISTORY

DeepSpeech was developed by Mozilla starting in 2017 as part of their Common Voice project to create open-source voice technology. Based on research by Baidu, it used recurrent neural networks for speech recognition. Mozilla discontinued active development in 2020 after layoffs, but the project was forked and continued by the community as Coqui STT.

SEE ALSO

vosk(1), whisper(1)

Copied to clipboard
Kai