LinuxCommandLibrary

camelot

TLDR

Extract tables from a PDF

$ camelot read -p [1] [document.pdf]
copy
Extract tables and save as CSV
$ camelot read -p [1] [document.pdf] -o [output.csv]
copy
Extract tables from multiple pages
$ camelot read -p [1,2,3] [document.pdf]
copy
Extract using stream mode (for borderless tables)
$ camelot read -p [1] -flavor stream [document.pdf]
copy
Extract with table area specification
$ camelot read -p [1] -T [50,700,500,100] [document.pdf]
copy
Generate visual debugging report
$ camelot read -p [1] -plot text [document.pdf]
copy
Export to multiple formats
$ camelot read -p [1] -f [json] [document.pdf]
copy

SYNOPSIS

camelot command [options] pdffile

DESCRIPTION

Camelot is a Python library and CLI tool for extracting tabular data from PDF files. It uses computer vision and lattice detection algorithms to identify tables and extract their contents into structured formats.
Two extraction methods are available: lattice mode detects tables with visible borders by looking for intersecting lines, while stream mode finds tables based on whitespace patterns, suitable for borderless tables.
The tool handles multi-page extraction, merged cells, and various output formats. Visual debugging helps understand how tables are detected and tune extraction parameters for difficult PDFs.

PARAMETERS

read

Read tables from PDF file.
-p, --pages pages
Page numbers to process (e.g., "1", "1-5", "1,3,5").
-o, --output file
Output file path.
-f, --format format
Output format: csv, excel, html, json, markdown, sqlite.
-flavor mode
Extraction mode: lattice (bordered) or stream (borderless).
-T, --table-areas coords
Table boundaries as x1,y1,x2,y2.
-C, --columns coords
Column separators for stream mode.
-plot type
Generate debug plot: text, grid, contour, joint, line.
-compress
Compress output file.
-split
Split output into separate files per table.

CAVEATS

Camelot only works with text-based PDFs; scanned documents require OCR first. Complex table layouts with nested tables or irregular structures may require manual parameter tuning. Stream mode accuracy depends heavily on consistent spacing. Large PDFs may consume significant memory.

HISTORY

Camelot was created by Vinayak Mehta and released in 2019 as an open-source alternative to commercial PDF table extraction tools. Named after the legendary castle, it was designed to be "excalibur for PDF table extraction." The project gained popularity for making table extraction accessible and programmable, filling a gap in the Python data science ecosystem.

SEE ALSO

Copied to clipboard