tabula
Extract tables from PDF documents
TLDR
Extract tables from PDF
SYNOPSIS
tabula [-p pages] [-o file] [-f format] [options] pdf
DESCRIPTION
tabula extracts tabular data from PDF documents and converts it into structured formats such as CSV, JSON, or TSV. It is designed for liberating data trapped in PDFs, where tables are visually rendered but not stored as actual data structures.
The tool offers two extraction modes: lattice mode detects tables by looking for ruling lines between cells, while stream mode uses whitespace and text alignment to identify column boundaries. Automatic detection chooses the best approach, but manual mode selection often improves accuracy for specific document layouts. An area option allows targeting specific page regions when only part of a page contains the desired table.
Tabula runs as a Java application and can process specific pages or entire documents. It was originally created as a web application for journalists needing to extract data from government reports and financial disclosures, and the command-line version provides the same extraction engine for scripting and automation workflows.
PARAMETERS
-p PAGES
Page numbers.-o FILE
Output file.-f FORMAT
Output format (CSV, JSON, TSV).-a AREA
Extraction area.-g
Guess table areas.-l
Force lattice mode.
CAVEATS
Java required. Table detection varies. Complex tables may fail.
HISTORY
Tabula was created by journalists at ProPublica and The New York Times for extracting data from PDF documents.
