LinuxCommandLibrary

tesseract

Perform optical character recognition (OCR)

TLDR

Recognize text in an image and save it to output.txt (the .txt extension is added automatically)

$ tesseract [image.png] [output]
copy

Specify a custom language (default is English) with an ISO 639-2 code (e.g. deu = Deutsch = German)
$ tesseract -l deu [image.png] [output]
copy

List the ISO 639-2 codes of available languages
$ tesseract --list-langs
copy

Specify a custom page segmentation mode (default is 3)
$ tesseract --psm [0_to_10] [image.png] [output]
copy

List page segmentation modes and their descriptions
$ tesseract --help-psm
copy

SYNOPSIS

tesseract imagename outputbase [options...]
tesseract --help | --version

imagename: The path to the input image file (e.g., JPEG, PNG, TIFF).
outputbase: The base name for the output file(s) (without extension).

PARAMETERS

-l LANG
    Specify the language(s) to use for OCR (e.g., eng for English, eng+fra for multiple).

--oem OEM_MODE
    Set the OCR Engine Mode (0=Legacy, 1=LSTM, 2=Legacy+LSTM, 3=Default).

--psm PSM_MODE
    Set the Page Segmentation Mode, determining how Tesseract interprets the page layout (e.g., 3=Auto, 6=Block, 7=Line).

-c VAR=VALUE
    Set a Tesseract configuration variable for fine-tuning.

--dpi DPI_VALUE
    Specify the input image DPI, crucial for correct scaling and recognition.

--tessdata-dir DIR
    Specify the directory where Tesseract language data files are located.

--help
    Show a help message and exit.

DESCRIPTION

Tesseract is a powerful, free, and open-source optical character recognition (OCR) engine used to convert images containing text into machine-readable text. Originally developed by Hewlett-Packard in the 1980s, it was open-sourced in 2005 and has been sponsored by Google since 2006.

It supports a wide variety of image formats, including TIFF, JPEG, PNG, and more. Tesseract's strength lies in its ability to accurately recognize text in numerous languages, provided the corresponding language data packs are installed. Users can specify the input image, a base name for the output file, and various options to control the recognition process.

Output can be generated in several formats, such as plain text, searchable PDF, hOCR (HTML with embedded OCR data), TSV, and ALTO XML. Tesseract is widely used for digitizing historical documents, making scanned documents searchable, automating data entry, and enhancing accessibility by converting image-based content into editable text. Its command-line interface makes it suitable for scripting and integration into larger workflows.

CAVEATS

Tesseract's accuracy is heavily dependent on the quality of the input image; poor resolution, skewed images, or complex backgrounds can significantly degrade results. It typically performs best on clear, well-scanned documents. While it supports many languages, the necessary language data packs must be separately installed. Handwriting recognition is generally less accurate compared to printed text. As a command-line tool, it lacks a built-in graphical user interface.

OUTPUT FORMATS

Tesseract can output recognized text in various formats beyond plain text, including searchable PDF, hOCR (HTML with embedded OCR data), TSV (tab-separated values), and ALTO XML, providing flexibility for different use cases.

CUSTOM CONFIGURATIONS

Users can create and apply custom configuration files to fine-tune Tesseract's behavior for specific tasks or document types, allowing for advanced control over the OCR process and optimizing accuracy for particular scenarios.

HISTORY

Tesseract was originally developed by Hewlett-Packard between 1985 and 1995 as a proprietary OCR engine. After being open-sourced under the Apache License 2.0 in 2005, Google took over its sponsorship and active development in 2006. Google significantly improved the engine, leading to the release of Tesseract 3.0, and later Tesseract 4.0 which introduced a new, more accurate LSTM-based neural network engine. This continuous development has made it one of the most widely used and advanced open-source OCR engines available.

SEE ALSO

convert(1), gs(1), ocrmypdf

Copied to clipboard