LinuxCommandLibrary

tesseract

Open-source optical character recognition engine

TLDR

Extract text from image

$ tesseract [image.png] [output]
copy
Extract to stdout
$ tesseract [image.png] stdout
copy
Specify language
$ tesseract -l [deu] [image.png] [output]
copy
Multiple languages
$ tesseract -l [eng+fra] [image.png] [output]
copy
Output as PDF
$ tesseract [image.png] [output] pdf
copy
Output as hOCR (HTML with coordinates)
$ tesseract [image.png] [output] hocr
copy
Output as TSV
$ tesseract [image.png] [output] tsv
copy
List available languages
$ tesseract --list-langs
copy

SYNOPSIS

tesseract imagename outputbase [-l lang] [--psm mode] [--oem mode] [configfiles]

DESCRIPTION

Tesseract is an open-source OCR (Optical Character Recognition) engine. It extracts text from images, supporting over 100 languages.
The LSTM neural network engine (default) provides better accuracy than the legacy engine for most text. Engine mode selection (--oem) enables switching or combining engines.
Page segmentation modes (--psm) tell Tesseract what to expect: single character, word, line, block, or full page. Correct mode selection improves accuracy significantly.
Output formats include plain text, searchable PDF (text layer over image), hOCR (HTML with bounding boxes), TSV (detailed per-word data), and ALTO (XML archival format).
Image quality greatly affects results. Best results come from: high resolution (300+ DPI), good contrast, straight alignment, minimal noise. Preprocessing with ImageMagick or similar can help.
Language data files (traineddata) must be installed separately. Custom training can create models for specific fonts, historical documents, or specialized text.

PARAMETERS

-l LANG

Language(s) for OCR (eng, deu, fra, etc.).
--psm NUM
Page segmentation mode (0-13).
--oem NUM
OCR Engine mode (0=legacy, 1=LSTM, 2=both).
--dpi NUM
Override image DPI.
-c VAR=VALUE
Set config variable.
--tessdata-dir PATH
Location of language data.
--user-words FILE
User word list.
--user-patterns FILE
User patterns file.
--list-langs
List available languages.
--print-parameters
Print config parameters.
pdf
Output searchable PDF.
hocr
Output HTML with coordinates.
tsv
Output tab-separated values.
alto
Output ALTO XML.

CONFIGURATION

TESSDATA_PREFIX

Environment variable specifying the directory containing language data files (traineddata); defaults to the tessdata directory within the Tesseract installation
--tessdata-dir PATH
Command-line override for the language data directory location

CAVEATS

Accuracy varies with image quality. Complex layouts may not segment correctly. Handwriting recognition is limited. Custom training requires significant effort. Large language data files. Processing speed depends on image size and complexity.

HISTORY

Tesseract was developed at HP Labs from 1985 to 1994, then released as open source in 2005. Google took over development, adding LSTM neural network support in 2016 (version 4.0). It remains the most widely used open-source OCR engine, integrated into many applications and workflows.

SEE ALSO

> TERMINAL_GEAR

Curated for the Linux community

Copied to clipboard

> TERMINAL_GEAR

Curated for the Linux community