LinuxCommandLibrary

pdf2json

TLDR

Convert PDF to JSON

$ pdf2json [input.pdf] [output.json]
copy
Convert first page only
$ pdf2json -f [1] -l [1] [input.pdf] [output.json]
copy
Include form fields
$ pdf2json -form [input.pdf] [output.json]
copy
Split pages to separate files
$ pdf2json -split [input.pdf] [output_prefix]
copy

SYNOPSIS

pdf2json [options] input.pdf [output.json]

DESCRIPTION

pdf2json extracts PDF content to JSON format. It captures text, positions, fonts, and form fields, enabling programmatic access to PDF data.
Useful for data extraction, indexing, and PDF analysis.

PARAMETERS

-f num

First page.
-l num
Last page.
-form
Include form data.
-split
One file per page.
-enc encoding
Text encoding.

OUTPUT STRUCTURE

$ {
  "pages": [
    {
      "width": 612,
      "height": 792,
      "texts": [
        {"x": 72, "y": 720, "text": "Hello"}
      ]
    }
  ]
}
copy

CAVEATS

Text extraction quality varies. Complex layouts may not preserve structure. Images not extracted.

HISTORY

pdf2json is based on PDF.js or similar libraries, providing JSON export for PDF processing pipelines.

SEE ALSO

Copied to clipboard