pdf2json

extracts PDF content to JSON format

Convert PDF to JSON

$ pdf2json [input.pdf] [output.json]

Convert first page only

$ pdf2json -f [1] -l [1] [input.pdf] [output.json]

Include form fields

$ pdf2json -form [input.pdf] [output.json]

Split pages to separate files

$ pdf2json -split [input.pdf] [output_prefix]

pdf2json [options] input.pdf [output.json]

pdf2json extracts PDF content to JSON format. It captures text, positions, fonts, and form fields, enabling programmatic access to PDF data.

-f num

First page.

-l num

Last page.

-form

Include form data.

-split

One file per page.

-enc encoding

Text encoding.

$ {
  "pages": [
    {
      "width": 612,
      "height": 792,
      "texts": [
        {"x": 72, "y": 720, "text": "Hello"}
      ]
    }
  ]
}

Text extraction quality varies. Complex layouts may not preserve structure. Images not extracted.

pdf2json is based on PDF.js or similar libraries, providing JSON export for PDF processing pipelines.