tesseract
Perform optical character recognition (OCR)
TLDR
Recognize text in an image and save it to output.txt (the .txt extension is added automatically)
Specify a custom language (default is English) with an ISO 639-2 code (e.g. deu = Deutsch = German)
List the ISO 639-2 codes of available languages
Specify a custom page segmentation mode (default is 3)
List page segmentation modes and their descriptions
SYNOPSIS
tesseract imagename outputbase [options...]
tesseract --help | --version
imagename: The path to the input image file (e.g., JPEG, PNG, TIFF).
outputbase: The base name for the output file(s) (without extension).
PARAMETERS
-l LANG
Specify the language(s) to use for OCR (e.g., eng for English, eng+fra for multiple).
--oem OEM_MODE
Set the OCR Engine Mode (0=Legacy, 1=LSTM, 2=Legacy+LSTM, 3=Default).
--psm PSM_MODE
Set the Page Segmentation Mode, determining how Tesseract interprets the page layout (e.g., 3=Auto, 6=Block, 7=Line).
-c VAR=VALUE
Set a Tesseract configuration variable for fine-tuning.
--dpi DPI_VALUE
Specify the input image DPI, crucial for correct scaling and recognition.
--tessdata-dir DIR
Specify the directory where Tesseract language data files are located.
--help
Show a help message and exit.
DESCRIPTION
Tesseract is a powerful, free, and open-source optical character recognition (OCR) engine used to convert images containing text into machine-readable text. Originally developed by Hewlett-Packard in the 1980s, it was open-sourced in 2005 and has been sponsored by Google since 2006.
It supports a wide variety of image formats, including TIFF, JPEG, PNG, and more. Tesseract's strength lies in its ability to accurately recognize text in numerous languages, provided the corresponding language data packs are installed. Users can specify the input image, a base name for the output file, and various options to control the recognition process.
Output can be generated in several formats, such as plain text, searchable PDF, hOCR (HTML with embedded OCR data), TSV, and ALTO XML. Tesseract is widely used for digitizing historical documents, making scanned documents searchable, automating data entry, and enhancing accessibility by converting image-based content into editable text. Its command-line interface makes it suitable for scripting and integration into larger workflows.
CAVEATS
Tesseract's accuracy is heavily dependent on the quality of the input image; poor resolution, skewed images, or complex backgrounds can significantly degrade results. It typically performs best on clear, well-scanned documents. While it supports many languages, the necessary language data packs must be separately installed. Handwriting recognition is generally less accurate compared to printed text. As a command-line tool, it lacks a built-in graphical user interface.
OUTPUT FORMATS
Tesseract can output recognized text in various formats beyond plain text, including searchable PDF, hOCR (HTML with embedded OCR data), TSV (tab-separated values), and ALTO XML, providing flexibility for different use cases.
CUSTOM CONFIGURATIONS
Users can create and apply custom configuration files to fine-tune Tesseract's behavior for specific tasks or document types, allowing for advanced control over the OCR process and optimizing accuracy for particular scenarios.
HISTORY
Tesseract was originally developed by Hewlett-Packard between 1985 and 1995 as a proprietary OCR engine. After being open-sourced under the Apache License 2.0 in 2005, Google took over its sponsorship and active development in 2006. Google significantly improved the engine, leading to the release of Tesseract 3.0, and later Tesseract 4.0 which introduced a new, more accurate LSTM-based neural network engine. This continuous development has made it one of the most widely used and advanced open-source OCR engines available.