LinuxCommandLibrary

pdf-parser

Analyze and extract data from PDF files

TLDR

Display statistics for a PDF file

$ pdf-parser [[-a|--stats]] [path/to/file.pdf]
copy

Display objects of a specific type (/Font, /URI, ...) in a PDF file
$ pdf-parser [[-t|--type]] [/object_type] [path/to/file.pdf]
copy

Search for strings in indirect objects
$ pdf-parser [[-s|--search]] [search_string] [path/to/file.pdf]
copy

SYNOPSIS

pdf-parser.py [options] <pdf-file>

PARAMETERS

-a
    automatically analyze (decompress) all streams

-A
    search for all EExec / Flate streams (auto analysis)

-e
    extra analysis

-f <file>
    add file to pdf file (append or replace object stream)

-F
    dump raw file data

-g
    go to specific offset

-i
    display indirect object

-j
    search for JavaScript

-l
    load stream into memory

-o <objnr>
    go to object number

-O <objnr>
    display object (raw, no decode)

-p <password>
    password to decrypt file

-q
    super quiet (no output)

-r
    raw output (no text decode)

-s <string>
    search for string

-S
    do not show stream contents

-t
    display trailer

-v
    verbose

-x
    extract objects to files

-X
    extract objects and streams to files

-h
    show help

DESCRIPTION

pdf-parser is a specialized Python-based command-line tool developed by Didier Stevens for dissecting PDF files at a low level. It parses the internal structure of PDF documents, extracting and displaying objects, streams, cross-reference tables, and trailers without rendering the file. This makes it invaluable for digital forensics, malware analysis, and security research, particularly when investigating malicious PDFs that exploit vulnerabilities.

Key capabilities include searching for specific strings or patterns across objects, dumping raw streams, following indirect references, and performing automated analysis on all streams. Unlike rendering tools like evince or pdftotext, pdf-parser operates directly on the file's binary format, revealing hidden or obfuscated content such as JavaScript, embedded files, or anomalous structures.

It's lightweight, scriptable, and outputs human-readable hex and text dumps, aiding in reverse engineering. Commonly used in incident response to identify PDF-based attacks.

CAVEATS

Python script (requires Python 2/3); download from Didier Stevens' site as not in standard repos. Large PDFs may consume high memory. No PDF rendering; purely structural analysis.

INSTALLATION

Download pdf-parser.py from Didier Stevens' blog; run with python pdf-parser.py. No package manager install.

OUTPUT FORMAT

Hex dumps with ASCII preview; object headers like 'obj 1 0'; use -r for binary-safe output piping.

HISTORY

Created by Didier Stevens in 2008 amid rising PDF malware threats; evolved through versions for better stream handling and encryption support. Actively maintained on blog.didierstevens.com.

SEE ALSO

pdfid(1), qpdf(1), mutool(1), pdftk(1)

Copied to clipboard