whisper

Convert speech to text using machine learning

TLDR

Convert a specific audio file to all of the given file formats

$ whisper [path/to/audio.mp3]

Convert an audio file specifying the output format of the converted file

$ whisper [path/to/audio.mp3] --output_format [txt]

Convert an audio file using a specific model for conversion

$ whisper [path/to/audio.mp3] --model [tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large]

Convert an audio file specifying which language the audio file is in to reduce conversion time

$ whisper [path/to/audio.mp3] --language [english]

Convert an audio file and save it to a specific location

$ whisper [path/to/audio.mp3] --output_dir "[path/to/output]"

Convert an audio file in quiet mode

$ whisper [path/to/audio.mp3] --verbose [False]

<audio_path>
    Path to the audio file(s) to be processed. Supports various formats (MP3, WAV, FLAC, etc.) if ffmpeg is available.

--model <name>
    Specifies the Whisper model to use (e.g., tiny, base, small, medium, large, large-v2). Larger models offer better accuracy but require more resources.

--output_format <format>
    Sets the output file format (e.g., txt, vtt, srt, tsv, json, all).

--language <code>
    Specifies the audio language (e.g., en, es, fr). If omitted, the model attempts to auto-detect.

--task <type>
    Determines the operation (transcribe for speech-to-text, translate for speech-to-English).

--device <device>
    Specifies the computational device (cpu or cuda for GPU).

--output_dir <path>
    Directory where output files will be saved.

--fp16
    Enables mixed-precision floating point (FP16) inference, faster on compatible GPUs.

--verbose
    Displays more detailed processing information during execution.

DESCRIPTION

The whisper command, typically installed as part of OpenAI's open-source project, provides a robust command-line interface for speech-to-text transcription and audio translation. Leveraging advanced deep learning models trained on vast datasets, Whisper excels at converting spoken language into text with high accuracy. It supports a wide array of languages, automatically detecting the language if not specified. Users can choose from various model sizes, balancing speed and accuracy according to their needs. The command also offers flexibility in output formats, including plain text, SRT for subtitles, or JSON for structured data, making it versatile for diverse applications ranging from transcribing meetings to generating captions for videos. It represents a significant step in accessible, high-quality audio processing.

CAVEATS

The whisper command typically relies on a Python environment with PyTorch. GPU acceleration (CUDA) is highly recommended for processing larger models or long audio files efficiently. Model files can be substantial in size (up to several GBs for large models), requiring sufficient disk space and initial download time. Performance heavily depends on system resources.

INSTALLATION

The whisper command is typically installed via Python's package manager: pip install -U openai-whisper. For GPU support, additional PyTorch installation with CUDA drivers is required.

MODEL SELECTION

Whisper models vary in size (tiny, base, small, medium, large). Smaller models are faster and use less memory but are less accurate. Larger models offer superior accuracy, especially for complex audio or diverse languages, at the cost of increased processing time and resource usage.

HISTORY

Whisper was developed by OpenAI and first released in September 2022. OpenAI open-sourced the underlying models and code, leading to rapid community adoption and the development of various command-line interfaces and integrations. Its creation marked a significant advancement in general-purpose speech recognition and translation, making high-quality AI transcription more accessible.