LinuxCommandLibrary

tabula

Extract tables from PDF documents

TLDR

Extract tables from PDF

$ tabula [document.pdf]
copy
Output as CSV
$ tabula -o [output.csv] [document.pdf]
copy
Specific pages
$ tabula -p [1,2,3] [document.pdf]
copy
JSON output
$ tabula -f JSON [document.pdf]
copy
All pages
$ tabula -p all [document.pdf]
copy
With area
$ tabula -a [0,0,100,100] [document.pdf]
copy

SYNOPSIS

tabula [-p pages] [-o file] [-f format] [options] pdf

DESCRIPTION

tabula extracts tabular data from PDF documents and converts it into structured formats such as CSV, JSON, or TSV. It is designed for liberating data trapped in PDFs, where tables are visually rendered but not stored as actual data structures.
The tool offers two extraction modes: lattice mode detects tables by looking for ruling lines between cells, while stream mode uses whitespace and text alignment to identify column boundaries. Automatic detection chooses the best approach, but manual mode selection often improves accuracy for specific document layouts. An area option allows targeting specific page regions when only part of a page contains the desired table.
Tabula runs as a Java application and can process specific pages or entire documents. It was originally created as a web application for journalists needing to extract data from government reports and financial disclosures, and the command-line version provides the same extraction engine for scripting and automation workflows.

PARAMETERS

-p PAGES

Page numbers.
-o FILE
Output file.
-f FORMAT
Output format (CSV, JSON, TSV).
-a AREA
Extraction area.
-g
Guess table areas.
-l
Force lattice mode.

CAVEATS

Java required. Table detection varies. Complex tables may fail.

HISTORY

Tabula was created by journalists at ProPublica and The New York Times for extracting data from PDF documents.

SEE ALSO

> TERMINAL_GEAR

Curated for the Linux community

Copied to clipboard

> TERMINAL_GEAR

Curated for the Linux community