LinuxCommandLibrary

tesseract

Perform optical character recognition (OCR)

TLDR

Recognize text in an image and save it to output.txt (the .txt extension is added automatically)

$ tesseract [image.png] [output]
copy

Specify a custom language (default is English) with an ISO 639-2 code (e.g. deu = Deutsch = German)
$ tesseract -l deu [image.png] [output]
copy

List the ISO 639-2 codes of available languages
$ tesseract --list-langs
copy

Specify a custom page segmentation mode (default is 3)
$ tesseract --psm [0_to_10] [image.png] [output]
copy

List page segmentation modes and their descriptions
$ tesseract --help-psm
copy

SYNOPSIS

tesseract imagename outputbase [configfiles...]

PARAMETERS

imagename
    Path to the input image file. Supported formats include PNG, JPEG, TIFF, PDF, etc.

outputbase
    The base name for the output text file. Tesseract will append a '.txt' extension.

[configfiles...]
    Optional configuration files to customize the OCR process. These files can be used to specify parameters such as the language, page segmentation mode, and OCR engine mode.

-l lang
    Specify language(s) to use for OCR. Can specify multiple languages by joining them with '+'.

--psm mode
    Set page segmentation mode. Common modes include: 0 (orientation and script detection only), 3 (fully automatic page segmentation), 6 (assume a single uniform block of text), and 7 (treat the image as a single text line).

--oem mode
    Set OCR Engine mode. 0 (Legacy engine only), 1 (Neural nets LSTM engine only), 2 (Tesseract + LSTM engines), 3 (Default, based on what is available).

configfile
    Specify a configuration file to use. Multiple configuration files can be specified.

DESCRIPTION

Tesseract is a powerful open-source Optical Character Recognition (OCR) engine that allows you to convert images containing text into editable and searchable text. It is primarily used from the command line, making it a versatile tool for batch processing, automation, and integration into larger workflows.

The tesseract command takes an input image as an argument and outputs the recognized text, either to standard output or to a specified file. It supports a wide variety of image formats, including PNG, JPEG, TIFF, and PDF. Furthermore, Tesseract supports multiple languages and includes advanced features such as page layout analysis, text orientation detection, and character segmentation to provide high-quality OCR results. Its modular architecture allows developers to extend its capabilities through custom scripts and plugins.

While powerful, obtaining optimal results with tesseract often requires pre-processing the input image to improve clarity and contrast. Common pre-processing steps include thresholding, noise reduction, and skew correction.

CAVEATS

OCR accuracy can vary depending on the quality of the input image, the font used, and the complexity of the layout. Pre-processing images can greatly improve accuracy. Tesseract may struggle with images containing complex layouts or unusual fonts.

PAGE SEGMENTATION MODES (--PSM)

Understanding the different page segmentation modes is crucial for obtaining accurate OCR results.
For example:
0: Orientation and script detection only.
1: Automatic page segmentation with OSD.
2: Automatic page segmentation, but no OSD, or OCR.
3: Fully automatic page segmentation, but no OSD. (Default)
4: Assume a single column of text of variable sizes.
5: Assume a single uniform block of vertically aligned text.
6: Assume a single uniform block of text.
7: Treat the image as a single text line.
8: Treat the image as a single word.
9: Treat the image as a single word in a circle.
10: Treat the image as a single character.
11: Sparse text. Find as much text as possible in no particular order.
12: Sparse text with OSD.
13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

HISTORY

Tesseract was originally developed at Hewlett-Packard Laboratories in the 1980s. It was open-sourced by HP and UNLV in 2005 and development has continued under Google since 2006. Tesseract is one of the most popular and widely used OCR engines in the world, and is a valuable tool for archiving, data extraction, and accessibility.

SEE ALSO

convert(1) - ImageMagick tool for image manipulation, pdftotext(1) - PDF to text converter

Copied to clipboard