pdf-parser

Analyze and extract data from PDF files

TLDR

Display statistics for a PDF file

$ pdf-parser [[-a|--stats]] [path/to/file.pdf]

Display objects of a specific type (/Font, /URI, ...) in a PDF file

$ pdf-parser [[-t|--type]] [/object_type] [path/to/file.pdf]

Search for strings in indirect objects

$ pdf-parser [[-s|--search]] [search_string] [path/to/file.pdf]

--object
    Displays the content of the specified object ID(s). Multiple IDs can be provided comma-separated (e.g., 1,2,5).

--extracted
    When used with --object, extracts and decompresses the content of stream objects, revealing hidden data.

--search
    Searches for a specified string within the PDF document's objects and streams. Useful for finding keywords or indicators of compromise.

--raw
    Displays the raw, undecoded content of selected objects or streams, bypassing any applied filters.

--dump
    Dumps the content of a selected object or stream to a specified file, allowing external analysis of the extracted data.

--stats
    Provides statistics about the PDF document, such as object types, counts, and potential anomalies in its structure.

--type
    Filters and displays only objects of a specific type (e.g., /Page, /XRef, /Annot, /JavaScript).

--hash
    Calculates and displays various hashes (MD5, SHA1, SHA256) for stream contents, aiding in identifying known malicious content.

--debug
    Enables verbose debug output, providing more detailed information about the parsing process and internal operations, useful for troubleshooting.

DESCRIPTION

pdf-parser is a powerful Python tool designed by Didier Stevens for in-depth analysis of Portable Document Format (PDF) files. It allows users to parse the internal structure of a PDF, examine its objects, streams, and cross-reference table. This command-line utility is invaluable for security researchers, digital forensics investigators, and anyone needing to understand the low-level composition of a PDF.

It can extract raw data, identify compressed streams, and help uncover malicious content by dissecting the document into its constituent parts. Users can specify objects to extract, filter by type, and decompress streams for further inspection, making it a critical tool for threat intelligence and incident response.

CAVEATS

pdf-parser is primarily a Python script, requiring a functional Python environment to run.
Effective use often requires a basic understanding of PDF document structure due to its low-level nature.
It focuses on structural and content analysis and does not render the PDF's visual appearance.
Processing very large, complex, or heavily obfuscated PDFs can be resource-intensive or slow.

PURPOSE IN SECURITY ANALYSIS

pdf-parser is critical in digital forensics and incident response for dissecting suspicious PDF files. It helps identify obfuscated content, extract embedded objects (like JavaScript or executables), and analyze potential exploits or malware hidden within the document structure, which is crucial for threat intelligence.

EXTENSIBILITY

Being a command-line utility implemented in Python, pdf-parser is highly extensible. Experienced users can easily inspect or modify its source code to tailor its functionality for specific, complex analysis tasks or integrate it into larger automated security analysis workflows.

HISTORY

Developed by Didier Stevens, a prominent security researcher, pdf-parser is an integral part of his suite of open-source security tools. Written in Python, it has been actively maintained and widely adopted by the security community since its inception for deep PDF analysis due to its flexibility and low-level insight into document internals.