LinuxCommandLibrary
GitHubF-DroidGoogle Play Store

whisper

AI-powered speech recognition and transcription

TLDR

Transcribe audio file
$ whisper [audio.mp3]
copy
Transcribe with specific model
$ whisper --model [medium] [audio.mp3]
copy
Transcribe with language hint
$ whisper --language [en] [audio.mp3]
copy
Output specific format
$ whisper --output_format [srt] [audio.mp3]
copy
Translate to English
$ whisper --task translate [audio.mp3]
copy
Output to specific directory
$ whisper --output_dir [/path/to/output] [audio.mp3]
copy
Transcribe multiple files
$ whisper [audio1.mp3] [audio2.wav]
copy
Use GPU with float16
$ whisper --device cuda --fp16 True [audio.mp3]
copy

SYNOPSIS

whisper [--model size] [--language lang] [--task task] [--outputformat fmt] [options] files_

DESCRIPTION

Whisper is OpenAI's automatic speech recognition (ASR) system. It transcribes audio in many languages and can translate to English.Model sizes trade accuracy for speed: tiny runs fastest, large is most accurate. The turbo model (default) offers a good balance, running ~8x faster than large with minor quality loss. The .en suffix (tiny.en, base.en) denotes English-only models, slightly better for English. The turbo model is not trained for translation tasks.Language detection is automatic but can be hinted. For non-English audio, specifying the language improves accuracy. Translation mode transcribes any language to English text.Output formats include plain text, subtitles (SRT, VTT), and JSON with timing data. Word-level timestamps enable karaoke-style highlighting.Processing uses GPU (CUDA) when available, significantly faster than CPU. The --fp16 flag enables half-precision math on compatible GPUs.Audio preprocessing handles various formats via FFmpeg. Long files are processed in segments with context maintained across segments.

PARAMETERS

--model SIZE

Model size: tiny, base, small, medium, large, turbo (default: turbo). English-only variants: tiny.en, base.en, small.en, medium.en.
--language LANG
Language code (en, de, fr, etc.) or auto.
--task TASK
Task: transcribe or translate.
--output_format FORMAT
Output format: txt, vtt, srt, tsv, json, all.
--output_dir DIR
Output directory.
--device DEVICE
Device: cpu, cuda.
--fp16 / --no-fp16
Use float16 (GPU) or float32.
--temperature TEMP
Sampling temperature.
--best_of NUM
Number of candidates.
--beam_size NUM
Beam search size.
--word_timestamps BOOL
Include word-level timestamps.
--condition_on_previous_text BOOL
Use previous output as context.
--verbose BOOL
Show progress and transcription.
--threads NUM
CPU threads.
--model_dir DIR
Directory to save and load models (default: ~/.cache/whisper).
--initial_prompt TEXT
Optional text to provide as prompt for the first window.
--clip_timestamps TIMESTAMPS
Comma-separated start/end timestamps to process specific audio segments.

CAVEATS

Large models require significant VRAM (10GB+ for large). CPU inference is slow. Accuracy varies by audio quality and accent. Hallucinations possible on silent or noisy segments. No speaker diarization. Model download required on first use.

HISTORY

Whisper was released by OpenAI in September 2022. Trained on 680,000 hours of multilingual audio, it achieved near-human transcription accuracy. The open-source release enabled local deployment, spawning community projects and integrations. The large-v3-turbo model was added in September 2024, offering significantly faster inference with minimal quality loss.

SEE ALSO

ffmpeg(1), vosk(1), deepspeech(1)

Copied to clipboard
Kai