LinuxCommandLibrary

speak-ng

Synthesize text to speech

TLDR

Speak a phrase aloud

$ speak-ng "[text]"
copy

Speak text from stdin
$ echo "[text]" | speak-ng
copy

Speak the contents of a [f]ile
$ speak-ng -f [path/to/file]
copy

Speak using a specific [v]oice
$ speak-ng -v [voice] "[text]"
copy

Speak at a specific [s]peed (default is 175) and [p]itch (default is 50)
$ speak-ng -s [speed] -p [pitch] "[text]"
copy

Output the audio to a [w]AV file instead of speaking it directly
$ speak-ng -w [path/to/output.wav] "[text]"
copy

List all available voices
$ speak-ng --voices
copy

SYNOPSIS

speak-ng [options] [text file | "text string"]

PARAMETERS

-f <file>
    Reads input text from the specified file instead of standard input. Each line is spoken.

-s <speed>
    Sets the speaking speed in words per minute (WPM). The default is usually 170 WPM. Valid range is 80 to 450.

-p <pitch>
    Sets the base pitch of the voice, from 0 to 99. Default is 50. Higher values result in a higher voice.

-a <amplitude>
    Sets the amplitude (volume) of the speech, from 0 to 200. Default is 100.

-g <gap>
    Pauses between words in 10ms units. Default is 0.

-l <lang>
    Sets the language for speech synthesis. E.g., en for English, fr for French. Use --voices to list available languages and voices.

-w <file>
    Writes the synthesized speech to a WAV file instead of playing it directly through the audio device.

--stdout
    Writes the synthesized speech as raw audio to standard output, suitable for piping to other audio utilities like aplay.

-m
    Treats the input text as Speech Synthesis Markup Language (SSML). This allows for more precise control over speech attributes.

-q
    Quiet mode. Suppresses informational messages and status output.

-x
    Outputs the phonemes that would be spoken, instead of synthesizing speech. Useful for debugging or linguistic analysis.

--voices [<lang>]
    Lists all available voices and their associated languages. If a language is specified, only voices for that language are shown.

-ven+f2
    Specifies a voice variant. For example, en+f2 selects a female voice variant for English. Variants vary by language.

-h, --help
    Displays a help message with command usage and options.

-v, --version
    Displays the version information of the speak-ng utility.

DESCRIPTION

speak-ng, often acting as an alias or a direct frontend for the espeak-ng utility, is a powerful and compact open-source text-to-speech (TTS) synthesizer for Linux and other operating systems. Its primary function is to convert written text into spoken audio, making it an invaluable tool for accessibility, scripting, and various command-line applications.

Unlike more modern neural network-based TTS systems, speak-ng employs formant synthesis, which produces a distinctive, often described as 'robotic' or 'computerized,' voice. Despite this characteristic, its efficiency, small footprint, and extensive language support (over 100 languages) make it highly practical. Users can customize various aspects of the synthesized speech, including the speaking speed, pitch, amplitude (volume), word gap, and even select different voice variants (e.g., male, female, child).

The command can process text from standard input, a specified text file, or directly from a string provided on the command line. Output can be directed to the system's audio device for immediate playback or saved as a WAV audio file for later use. This flexibility, combined with its ability to interpret Speech Synthesis Markup Language (SSML), makes speak-ng a versatile utility for developers, system administrators, and anyone needing quick and reliable text-to-speech functionality without heavy resource demands. Its lightweight nature ensures fast synthesis, making it ideal for automation tasks and integrating into scripts.

CAVEATS

Due to its use of formant synthesis, the speech quality produced by speak-ng can sound robotic or artificial compared to modern neural network-based TTS engines.

While highly configurable, achieving perfectly natural-sounding speech across all languages can be challenging.

Installation typically requires the espeak-ng package and its associated language data, which might not be installed by default on all systems.

SSML SUPPORT

speak-ng can interpret Speech Synthesis Markup Language (SSML), allowing users to embed special tags within their text to control pronunciation, emphasis, pauses, and other speech characteristics with greater precision. This is enabled using the -m option.

PHONEME OUTPUT

Beyond synthesizing audible speech, speak-ng offers the unique capability to output the phonemes of the input text using the -x option. This feature is highly valuable for linguistic analysis, debugging pronunciation, or integration into applications that require phonemic representation rather than direct audio output.

HISTORY

The speak-ng utility is a part of, or often an alias for, espeak-ng. eSpeak-NG (Next Generation) itself is a community-driven fork and successor to the original eSpeak speech synthesizer, initially developed by Jonathan Duddington. The 'NG' in its name signifies its role as a continuation and enhancement project, focusing on improving language data, fixing bugs, and adding new features. Its development aims to maintain and advance a compact, open-source TTS solution that supports a vast array of languages for various platforms, building upon the robust foundation laid by its predecessor.

SEE ALSO

espeak-ng(1), aplay(1), cat(1), festival(1)

Copied to clipboard