LinuxCommandLibrary

pup

Extract data from HTML at the command line

TLDR

Transform a raw HTML file into a cleaned, indented, and colored format

$ cat [index.html] | pup --color
copy

Filter HTML by element tag name
$ cat [index.html] | pup '[tag]'
copy

Filter HTML by ID
$ cat [index.html] | pup '[div#id]'
copy

Filter HTML by attribute value
$ cat [index.html] | pup '[input[type="text"]]'
copy

Print all text from the filtered HTML elements and their children
$ cat [index.html] | pup '[div] text{}'
copy

Print HTML as JSON
$ cat [index.html] | pup '[div] json{}'
copy

SYNOPSIS

pup [options] [selector] [attribute]
pup [options] --file filename

PARAMETERS

-i, --input
    Specify an input HTML file. If omitted, pup reads from standard input.

-o, --output
    Write output to a specified file instead of standard output.

-c, --color
    Enable syntax highlighting for HTML output, useful for viewing on terminals.

-p, --pretty
    Pretty-print HTML output with proper indentation and newlines.

--attr
    Extract the value of a specific attribute (e.g., 'href', 'src') from the selected elements.

--text
    Extract only the inner text content of the selected elements, stripping all HTML tags.

--json
    Output selected elements as a JSON array, with each element represented as an object.

--xml
    Output selected elements formatted as XML.

--html
    Output selected elements as HTML (this is the default behavior).

--plain
    Output selected elements as plain text, similar to --text but potentially with different formatting.

--raw
    Output the raw HTML of selected elements without any reformatting.

--pre
    Preserve leading whitespace in text output.

--file
    Read HTML from the specified filename, an alternative to piping input or using -i.

-v, --version
    Print the current version information of pup.

-h, --help
    Display the help message and exit.

DESCRIPTION

pup is a command-line tool designed for processing HTML content, analogous to how jq handles JSON. It parses HTML documents, allowing users to select elements using standard CSS selectors, and then extract, modify, or reformat the matched content. Its primary use cases include lightweight web scraping, extracting specific data from HTML files or web pages, and transforming HTML into other formats like plain text, JSON, or XML. pup can read HTML from standard input or a specified file, and its output can be piped to other command-line utilities for further processing. It simplifies complex HTML parsing tasks by providing a familiar CSS selector syntax, making it accessible for developers and system administrators alike.

CAVEATS

Parsing Malformed HTML: While pup is robust, extremely malformed or non-standard HTML might not parse as expected, leading to incomplete or incorrect extractions.
Resource Usage: For very large HTML documents, pup might consume significant memory as the entire document is loaded for parsing.
Dynamic Content: pup processes static HTML content. It cannot execute JavaScript or interact with dynamic content loaded after the initial page load, unlike headless browser solutions.

COMMON USAGE PATTERN

pup is frequently used in conjunction with curl or wget to fetch web pages, then pipe their HTML output directly into pup for extraction. For example:
curl https://example.com | pup 'div.content a' attr href
This command fetches the content of example.com, pipes it to pup, selects all <a> tags within elements with class content, and extracts their href attributes.

SELECTOR SPECIFICITY

pup supports a wide range of CSS selectors, from simple tag names and classes to more complex attribute selectors, pseudo-classes (like :first-child, :nth-of-type), and combinators (e.g., > for direct children, + for adjacent siblings), allowing for precise targeting of HTML elements within a document.

HISTORY

pup was created by Eric Chiang and is written in Go, a programming language known for its efficiency and concurrency. Its development aimed to provide a lightweight, self-contained tool for HTML manipulation that could integrate seamlessly into command-line workflows, mirroring the success of jq for JSON. The choice of Go facilitates easy cross-platform compilation and deployment, making pup a portable and robust utility for developers and system administrators needing to parse HTML without the overhead of scripting languages or full-fledged browser environments.

SEE ALSO

jq(1), xmlstarlet(1), grep(1), awk(1), sed(1), curl(1)

Copied to clipboard