LinuxCommandLibrary

pdfimages

Extract images from PDF documents

TLDR

Extract all images from a PDF file and save them as PNGs

$ pdfimages -png [path/to/file.pdf] [filename_prefix]
copy

Extract images from pages 3 to 5
$ pdfimages -f [3] -l [5] [path/to/file.pdf] [filename_prefix]
copy

Extract images from a PDF file and include the page number in the output filenames
$ pdfimages -p [path/to/file.pdf] [filename_prefix]
copy

List information about all the images in a PDF file
$ pdfimages -list [path/to/file.pdf]
copy

SYNOPSIS

pdfimages [options] <PDF-file> <image-root-name>

PARAMETERS

-f <page>
    Specifies the first page to scan.

-l <page>
    Specifies the last page to scan.

-j
    Write JPEG images as is.

-jp2
    Write JPX (JPEG2000) images as is.

-jbig2
    Write JBIG2 images as is.

-all
    Write all supported image types as is (equivalent to -j -jp2 -jbig2).

-png
    Write images in PNG format (implies -rgb, -gray, or -mono based on color depth).

-tiff
    Write images in TIFF format (implies -rgb, -gray, or -mono based on color depth).

-list
    List images and their properties without extracting them.

-opw <password>
    Specify the owner password for encrypted files.

-upw <password>
    Specify the user password for encrypted files.

-r <resolution>
    Specify the resolution (in DPI) for rasterized images. Default is 150 DPI.

-mono
    Generate monochrome PBM images.

-gray
    Generate grayscale PGM images.

-rgb
    Generate color PPM images.

DESCRIPTION

pdfimages is a command-line utility for extracting raster images (bitmaps) from Portable Document Format (PDF) files. It is part of the poppler-utils package, which provides a set of tools built on the Poppler PDF rendering library. This command is particularly useful for designers, developers, or anyone needing to reuse graphical content embedded within a PDF document without resorting to screen captures or opening the document in a specific editor.

The tool can identify various image formats within a PDF, such as JPEG, JPEG2000, JBIG2, CCITT, and raw bitmap data. When extracting, pdfimages attempts to save the images in their original format if possible, especially for JPEG, JPEG2000, and JBIG2 formats. For other formats, or if an image is encoded in a way that isn't directly extractable as a standard image file, pdfimages will convert it into a common image format like PPM (for color), PGM (for grayscale), or PBM (for monochrome), or optionally into PNG or TIFF.

It provides options to specify page ranges, output image types, and even handle password-protected PDF files. The extracted images are named based on a user-defined root name, appended with a sequence number and the appropriate file extension. This makes pdfimages an invaluable tool for content extraction and analysis.

CAVEATS

pdfimages primarily extracts raster images. It cannot extract vector graphics (like those created in Adobe Illustrator or CAD programs) or text objects directly as editable formats. Such elements, if not explicitly rasterized within the PDF, would typically require a different tool (e.g., pdftocairo for SVG output, or pdftotext for text).

Images that are not stored in a directly extractable format (JPEG, JPX, JBIG2) are converted to PBM/PGM/PPM, PNG, or TIFF, which might involve re-encoding and potential loss of quality, especially if a lower resolution is specified or implied. Some PDF viewers or editors might display content that pdfimages cannot directly identify as a distinct image object.

OUTPUT NAMING CONVENTION

pdfimages names the extracted files using the specified <image-root-name> followed by a hyphen, a four-digit sequence number, and the appropriate file extension (e.g., root-name-0001.jpg, root-name-0002.png).

DEFAULT OUTPUT FORMATS

If no specific output format (like -j, -png, -tiff, -mono, -gray, -rgb) is chosen, pdfimages will default to creating PPM (color), PGM (grayscale), or PBM (monochrome) files, depending on the image's color depth. It prioritizes direct extraction for JPEG, JPX, and JBIG2 if possible, unless a specific conversion format like PNG or TIFF is requested.

HISTORY

pdfimages is part of the Poppler utilities, which are derived from the Xpdf project. Xpdf was originally developed by Derek Noonburg. Poppler is a free software fork of Xpdf, initiated in 2005, and continues to be actively developed, maintained by the freedesktop.org project. Its goal was to provide better free software support for PDF rendering and tools, moving beyond Xpdf's more restrictive licensing at the time. pdfimages has been a fundamental tool within both Xpdf and Poppler for image extraction from their early versions.

SEE ALSO

pdftotext(1), pdftocairo(1), pdffonts(1), pdfinfo(1), pdftoppm(1), pdftopng(1)

Copied to clipboard