LinuxCommandLibrary

espeak

Convert text to speech

TLDR

Speak a phrase aloud

$ espeak "I like to ride my bike."
copy

Speak a file aloud
$ espeak -f [path/to/file]
copy

Save output to a WAV audio file, rather than speaking it directly
$ espeak -w [filename.wav] "It's GNU plus Linux"
copy

Use a different voice
$ espeak -v [voice]
copy

SYNOPSIS

espeak [options] [text]
espeak -f <file> [options]

PARAMETERS

-h, --help
    Displays help information and exits.

-v , --voice=
    Selects a specific voice to use for speech synthesis. Examples: 'en', 'en-us', 'en/f3'.

-s , --speed=
    Sets the speaking speed in words per minute. Range: 80 to 450.

-p , --pitch=<0-99>
    Adjusts the base pitch of the voice. Range: 0 (low) to 99 (high).

-a , --amplitude=<0-200>
    Sets the amplitude (volume) of the voice. Range: 0 (silent) to 200 (loudest).

-g , --gap=<10ms units>
    Inserts a pause between words, specified in units of 10 milliseconds.

-k , --capitals=
    Controls how capital letters are spoken. 1=speak capital words, 2=speak individual capital letters.

-l , --language=
    Specifies the language to use for text processing.

-q, --quiet
    Suppresses audio output. Useful when only phonetic output is desired.

-w , --wav=
    Writes the synthesized speech directly to a WAV audio file instead of playing it.

-z, --stdout
    Writes the synthesized speech to standard output (stdout), typically for piping.

--stdin
    Reads input text from standard input (stdin) until EOF.

-f , --file=
    Reads input text from the specified text file.

--ipa
    Outputs the phonetic transcription of the text in IPA (International Phonetic Alphabet) format.

--pho
    Outputs the phonetic transcription of the text using eSpeak's internal phoneme alphabet.

--version
    Displays version information about eSpeak.

--voices[=]
    Lists all available voices, optionally filtered by a specific language. Use '--voices=mb' for MBROLA voices.

-m
    Interprets input text as SSML (Speech Synthesis Markup Language).

-x
    Writes phoneme translations to stdout, showing the phonemes for each word.

-X
    Writes phoneme translations to stdout, including word identities.

-b
    Sets the level for sentence breaks. Higher values indicate more breaks.

-d
    Selects a specific audio output device.

-L
    Disables adding a space between words.

DESCRIPTION

espeak is a software speech synthesizer for Linux and other operating systems. It converts text into spoken audio, making it a valuable tool for accessibility, scripting, and embedded applications.

It utilizes a formant synthesis method, which is highly efficient and results in a small footprint, suitable for systems with limited resources. While this method can sometimes sound less natural than modern concatenative or neural network-based synthesizers, it provides clear and intelligible speech.

espeak supports a wide array of languages, typically over 100, and allows users to control various aspects of the speech, including speed, pitch, volume, and the specific voice used. It can output speech directly to an audio device or save it as a WAV file. Additionally, it has capabilities for interpreting SSML (Speech Synthesis Markup Language) and can output phonetic transcriptions.

CAVEATS

Due to its use of formant synthesis, espeak's voices can sometimes sound robotic or less natural compared to more advanced text-to-speech systems that use concatenative or deep learning methods. Voice quality and naturalness can vary significantly between different languages.
It also requires an audio output device or explicit redirection to a WAV file for audible output.

SSML SUPPORT

espeak supports a subset of the W3C's SSML (Speech Synthesis Markup Language). This allows for greater control over speech output, including adding pauses, emphasis, changes in pitch, and other speech characteristics within the input text itself.

PHONETIC OUTPUT

Beyond generating audio, espeak can output the phonetic transcription of text using either the International Phonetic Alphabet (IPA) or its own internal phoneme representation. This feature is valuable for linguistic analysis, debugging pronunciation issues, or for applications that require programmatic access to phoneme data.

CUSTOMIZATION

Users can extend espeak's capabilities by creating or modifying pronunciation dictionaries and voice definitions. This allows for tailoring the speech output to specific vocabulary or regional accents, enhancing its flexibility for specialized applications.

HISTORY

espeak was developed by Jonathan Duddington and first released around 2006. Its design prioritized compactness and efficiency, making it highly suitable for resource-constrained environments, such as embedded systems, and for integration into assistive technologies like screen readers (e.g., Orca). The project's focus on a highly efficient formant synthesis engine distinguished it. While espeak is still widely used, development has largely continued under the espeak-ng (Next Generation) fork, which offers ongoing updates, bug fixes, and additional features.

SEE ALSO

aplay(1), play(1), festival(1), flite(1), spd-say(1)

Copied to clipboard