espeak
Convert text to speech
TLDR
Speak a phrase aloud
Speak a file aloud
Save output to a WAV audio file, rather than speaking it directly
Use a different voice
SYNOPSIS
espeak [options] [text]
espeak -f <file> [options]
PARAMETERS
-h, --help
Displays help information and exits.
-v
Selects a specific voice to use for speech synthesis. Examples: 'en', 'en-us', 'en/f3'.
-s
Sets the speaking speed in words per minute. Range: 80 to 450.
-p
Adjusts the base pitch of the voice. Range: 0 (low) to 99 (high).
-a
Sets the amplitude (volume) of the voice. Range: 0 (silent) to 200 (loudest).
-g
Inserts a pause between words, specified in units of 10 milliseconds.
-k
Controls how capital letters are spoken. 1=speak capital words, 2=speak individual capital letters.
-l
Specifies the language to use for text processing.
-q, --quiet
Suppresses audio output. Useful when only phonetic output is desired.
-w
Writes the synthesized speech directly to a WAV audio file instead of playing it.
-z, --stdout
Writes the synthesized speech to standard output (stdout), typically for piping.
--stdin
Reads input text from standard input (stdin) until EOF.
-f
Reads input text from the specified text file.
--ipa
Outputs the phonetic transcription of the text in IPA (International Phonetic Alphabet) format.
--pho
Outputs the phonetic transcription of the text using eSpeak's internal phoneme alphabet.
--version
Displays version information about eSpeak.
--voices[=
Lists all available voices, optionally filtered by a specific language. Use '--voices=mb' for MBROLA voices.
-m
Interprets input text as SSML (Speech Synthesis Markup Language).
-x
Writes phoneme translations to stdout, showing the phonemes for each word.
-X
Writes phoneme translations to stdout, including word identities.
-b
Sets the level for sentence breaks. Higher values indicate more breaks.
-d
Selects a specific audio output device.
-L
Disables adding a space between words.
DESCRIPTION
espeak is a software speech synthesizer for Linux and other operating systems. It converts text into spoken audio, making it a valuable tool for accessibility, scripting, and embedded applications.
It utilizes a formant synthesis method, which is highly efficient and results in a small footprint, suitable for systems with limited resources. While this method can sometimes sound less natural than modern concatenative or neural network-based synthesizers, it provides clear and intelligible speech.
espeak supports a wide array of languages, typically over 100, and allows users to control various aspects of the speech, including speed, pitch, volume, and the specific voice used. It can output speech directly to an audio device or save it as a WAV file. Additionally, it has capabilities for interpreting SSML (Speech Synthesis Markup Language) and can output phonetic transcriptions.
CAVEATS
Due to its use of formant synthesis, espeak's voices can sometimes sound robotic or less natural compared to more advanced text-to-speech systems that use concatenative or deep learning methods. Voice quality and naturalness can vary significantly between different languages.
It also requires an audio output device or explicit redirection to a WAV file for audible output.
SSML SUPPORT
espeak supports a subset of the W3C's SSML (Speech Synthesis Markup Language). This allows for greater control over speech output, including adding pauses, emphasis, changes in pitch, and other speech characteristics within the input text itself.
PHONETIC OUTPUT
Beyond generating audio, espeak can output the phonetic transcription of text using either the International Phonetic Alphabet (IPA) or its own internal phoneme representation. This feature is valuable for linguistic analysis, debugging pronunciation issues, or for applications that require programmatic access to phoneme data.
CUSTOMIZATION
Users can extend espeak's capabilities by creating or modifying pronunciation dictionaries and voice definitions. This allows for tailoring the speech output to specific vocabulary or regional accents, enhancing its flexibility for specialized applications.
HISTORY
espeak was developed by Jonathan Duddington and first released around 2006. Its design prioritized compactness and efficiency, making it highly suitable for resource-constrained environments, such as embedded systems, and for integration into assistive technologies like screen readers (e.g., Orca). The project's focus on a highly efficient formant synthesis engine distinguished it. While espeak is still widely used, development has largely continued under the espeak-ng (Next Generation) fork, which offers ongoing updates, bug fixes, and additional features.