pup
Extract data from HTML at the command line
TLDR
Transform a raw HTML file into a cleaned, indented, and colored format
Filter HTML by element tag name
Filter HTML by ID
Filter HTML by attribute value
Print all text from the filtered HTML elements and their children
Print HTML as JSON
SYNOPSIS
pup [options] [selector] [attribute]
pup [options] --file filename
PARAMETERS
-i, --input
Specify an input HTML file. If omitted, pup reads from standard input.
-o, --output
Write output to a specified file instead of standard output.
-c, --color
Enable syntax highlighting for HTML output, useful for viewing on terminals.
-p, --pretty
Pretty-print HTML output with proper indentation and newlines.
--attr
Extract the value of a specific attribute (e.g., 'href', 'src') from the selected elements.
--text
Extract only the inner text content of the selected elements, stripping all HTML tags.
--json
Output selected elements as a JSON array, with each element represented as an object.
--xml
Output selected elements formatted as XML.
--html
Output selected elements as HTML (this is the default behavior).
--plain
Output selected elements as plain text, similar to --text but potentially with different formatting.
--raw
Output the raw HTML of selected elements without any reformatting.
--pre
Preserve leading whitespace in text output.
--file
Read HTML from the specified filename, an alternative to piping input or using -i.
-v, --version
Print the current version information of pup.
-h, --help
Display the help message and exit.
DESCRIPTION
pup is a command-line tool designed for processing HTML content, analogous to how jq handles JSON. It parses HTML documents, allowing users to select elements using standard CSS selectors, and then extract, modify, or reformat the matched content. Its primary use cases include lightweight web scraping, extracting specific data from HTML files or web pages, and transforming HTML into other formats like plain text, JSON, or XML. pup can read HTML from standard input or a specified file, and its output can be piped to other command-line utilities for further processing. It simplifies complex HTML parsing tasks by providing a familiar CSS selector syntax, making it accessible for developers and system administrators alike.
CAVEATS
Parsing Malformed HTML: While pup is robust, extremely malformed or non-standard HTML might not parse as expected, leading to incomplete or incorrect extractions.
Resource Usage: For very large HTML documents, pup might consume significant memory as the entire document is loaded for parsing.
Dynamic Content: pup processes static HTML content. It cannot execute JavaScript or interact with dynamic content loaded after the initial page load, unlike headless browser solutions.
COMMON USAGE PATTERN
pup is frequently used in conjunction with curl or wget to fetch web pages, then pipe their HTML output directly into pup for extraction. For example:curl https://example.com | pup 'div.content a' attr href
This command fetches the content of example.com, pipes it to pup, selects all <a>
tags within elements with class content
, and extracts their href
attributes.
SELECTOR SPECIFICITY
pup supports a wide range of CSS selectors, from simple tag names and classes to more complex attribute selectors, pseudo-classes (like :first-child
, :nth-of-type
), and combinators (e.g., >
for direct children, +
for adjacent siblings), allowing for precise targeting of HTML elements within a document.
HISTORY
pup was created by Eric Chiang and is written in Go, a programming language known for its efficiency and concurrency. Its development aimed to provide a lightweight, self-contained tool for HTML manipulation that could integrate seamlessly into command-line workflows, mirroring the success of jq for JSON. The choice of Go facilitates easy cross-platform compilation and deployment, making pup a portable and robust utility for developers and system administrators needing to parse HTML without the overhead of scripting languages or full-fledged browser environments.