pdftotext
converts Portable Document Format files to plain text
TLDR
Extract text from a PDF to stdout
SYNOPSIS
pdftotext [options] PDF-file [text-file]
DESCRIPTION
pdftotext converts Portable Document Format (PDF) files to plain text. It extracts the text content from PDF documents while optionally attempting to preserve the visual layout of the original document.
The program is part of the poppler-utils package (or xpdf-utils on older systems) and handles most PDF text extraction needs. It can process encrypted PDFs when provided with the appropriate password and supports various output encodings.
Common use cases include making PDF content searchable, extracting text for further processing, creating accessible versions of documents, and feeding PDF content into text analysis pipelines.
PARAMETERS
-f number
First page to convert (default: 1)-l number
Last page to convert (default: last page)-layout
Maintain original physical layout of the text-simple
Simple one-column page layout-table
Table mode, similar to layout but optimized for tables-lineprinter
Line printer mode with fixed-pitch font metrics-raw
Keep strings in content stream order-fixed number
Assume fixed-pitch font with specified character width-enc encoding
Output text encoding (Latin1, UTF-8, etc.)-nopgbrk
Don't insert page breaks between pages-opw password
Owner password for encrypted PDF-upw password
User password for encrypted PDF-q
Quiet mode, suppress messages and errors-v
Print version information-h
Print usage information
CAVEATS
Cannot extract text from scanned documents or image-based PDFs (use OCR tools like tesseract for those). Layout preservation may not be perfect for complex multi-column documents. Text in embedded fonts without Unicode mappings may not extract correctly. Ligatures and special characters may not render properly in all output encodings.
HISTORY
pdftotext was originally developed as part of the Xpdf project by Derek Noonburg in the late 1990s. The tool has since been incorporated into the Poppler library, a fork of Xpdf that has become the standard PDF rendering library on many Linux distributions. Both versions continue to be maintained, with Poppler receiving more active development and becoming the default on most modern systems.
