pdfdetach

Extract embedded files from a PDF

TLDR

List all attachments in a file with a specific text encoding

$ pdfdetach list -enc [UTF-8] [path/to/input.pdf]

Save specific embedded file by specifying its number

$ pdfdetach -save [number] [path/to/input.pdf]

Save specific embedded file by specifying its name

$ pdfdetach -savefile [name] [path/to/input.pdf]

Save the embedded file with a custom output filename

$ pdfdetach -save [number] -o [path/to/output] [path/to/input.pdf]

Save the attachment from a file secured by owner/user password

$ pdfdetach -save [number] [-opw|-upw] [password] [path/to/input.pdf]

SYNOPSIS

pdfdetach [options] <PDF-file>
Common options include listing attachments, saving all, or saving a specific attachment.

-list
    Lists the names and sizes of all embedded files in the specified PDF document.

-saveall
    Saves all embedded files found in the PDF to the specified output directory or the current working directory if no output directory is given.

-save file_number
    Saves the embedded file corresponding to the given file_number (obtained from the -list output) to the specified output directory or current directory.

-o output_dir
    Specifies the target directory where extracted files will be saved. If omitted, files are saved in the current directory.

PDF-file
    The path to the input PDF document from which attachments are to be extracted.

-v
    Prints the copyright and version information for pdfdetach.

-h
    Prints a concise usage summary and available options.

DESCRIPTION

pdfdetach is a command-line utility provided as part of the Poppler PDF rendering library. Its primary function is to extract embedded file attachments from Portable Document Format (PDF) files. These attachments can be any type of file (e.g., documents, spreadsheets, images, archives) that have been embedded within the PDF by its creator. The command allows users to first list all embedded files, showing their names and sizes, and then to extract selected or all of these attachments to the local filesystem. This tool is invaluable for automated processing of PDF documents that serve as containers for other data, enabling programmatic access to their internal components without requiring a graphical PDF viewer. It supports various options for specifying an output directory and precise extraction by attachment index.

CAVEATS

pdfdetach specifically targets embedded file attachments and does not extract other types of content such as images embedded within pages, text, or multimedia streams that are not file attachments. Extracted filenames are derived from the PDF's internal metadata, which may sometimes be generic or require sanitization for use on certain filesystems. It's a command-line tool, so selecting specific attachments requires knowing their index from a prior -list operation.

OUTPUT FILENAME GENERATION

When extracting files, pdfdetach attempts to use the original filename embedded within the PDF. If an original filename is not available or suitable, it may generate a generic filename, often in the format "attNNNNN.dat", where NNNNN is the attachment's index. Users should inspect the output directory to confirm the generated names.

EXIT STATUS

The pdfdetach command typically returns an exit status of 0 upon successful execution. A non-zero exit status indicates an error, such as a problem opening the PDF file, an invalid option, or an issue during extraction. This allows for its integration into shell scripts for error handling.

HISTORY

pdfdetach is part of the Poppler utilities suite, which originated as a free software fork of the Xpdf PDF viewer and toolkit in 2005. Poppler aimed to provide better support for free software environments and continuous development. pdfdetach has been a core component of this suite since its early days, providing essential functionality for programmatic interaction with PDF file attachments, crucial for automated document processing workflows.