camelot
TLDR
Extract tables from a PDF
SYNOPSIS
camelot command [options] pdffile
DESCRIPTION
Camelot is a Python library and CLI tool for extracting tabular data from PDF files. It uses computer vision and lattice detection algorithms to identify tables and extract their contents into structured formats.
Two extraction methods are available: lattice mode detects tables with visible borders by looking for intersecting lines, while stream mode finds tables based on whitespace patterns, suitable for borderless tables.
The tool handles multi-page extraction, merged cells, and various output formats. Visual debugging helps understand how tables are detected and tune extraction parameters for difficult PDFs.
PARAMETERS
read
Read tables from PDF file.-p, --pages pages
Page numbers to process (e.g., "1", "1-5", "1,3,5").-o, --output file
Output file path.-f, --format format
Output format: csv, excel, html, json, markdown, sqlite.-flavor mode
Extraction mode: lattice (bordered) or stream (borderless).-T, --table-areas coords
Table boundaries as x1,y1,x2,y2.-C, --columns coords
Column separators for stream mode.-plot type
Generate debug plot: text, grid, contour, joint, line.-compress
Compress output file.-split
Split output into separate files per table.
CAVEATS
Camelot only works with text-based PDFs; scanned documents require OCR first. Complex table layouts with nested tables or irregular structures may require manual parameter tuning. Stream mode accuracy depends heavily on consistent spacing. Large PDFs may consume significant memory.
HISTORY
Camelot was created by Vinayak Mehta and released in 2019 as an open-source alternative to commercial PDF table extraction tools. Named after the legendary castle, it was designed to be "excalibur for PDF table extraction." The project gained popularity for making table extraction accessible and programmable, filling a gap in the Python data science ecosystem.
SEE ALSO
tabula(1), pdftotext(1), pdfplumber(1)


