pup
parse and query HTML at the command line using CSS selectors
TLDR
SYNOPSIS
pup [options] ['selectors [display-function]']
DESCRIPTION
pup is the HTML counterpart to jq — it reads an HTML document from stdin (or a file via `-f`), applies a CSS-style selector to filter elements, and optionally runs a display function (`text{}`, `attr{…}`, `json{}`, `slice{…}`) to project matches into the form you want. It is a single static Go binary with no runtime dependencies, which makes it ideal for scraping pipelines and Makefiles.Because it understands most of CSS3 (including common pseudo-classes), many scraping problems reduce to a single pipe: `curl | pup 'selector json{}' | jq`.
PARAMETERS
-f, --file FILE
Read HTML from FILE instead of stdin.-c, --color
Colorize output.-p, --plain
Do not HTML-escape the output.--pre
Preserve whitespace (useful inside `<pre>`/`<code>`).-i, --indent N|CHAR
Indent by N spaces (or by the given character).-l, --limit N
Limit output nesting depth to N levels.-n, --number
Print the number of matching elements instead of the elements themselves.--charset ENCODING
Force input character encoding (default: auto-detect).-h, --help
Show help.--version
Show version.
SELECTORS AND DISPLAY FUNCTIONS
CSS selectors
Standard CSS syntax — `div.class`, `#id`, `a[href^="http"]`, `ul > li:first-child`, `tr:nth-child(even)`, etc. Multiple selectors can be chained with spaces to walk into nested structures.text{}
Emit text content (depth-first) of each matching element.attr{NAME}
Emit the value of the NAME attribute of each matching element.json{}
Emit matching elements as a JSON array of `{tag, attrs, children, text}` objects.slice{N} / slice{N:M}
Return only the Nth (or N through M-1) of matching elements.
CAVEATS
pup reads the whole input into memory — not suitable for multi-gigabyte HTML. The last upstream release was in 2016; fork `eiriklv/pup` and several drop-in replacements (`htmlq`, `xq`, `xidel`) provide newer features such as XPath or CSS4 selectors.
HISTORY
pup was written by Eric Chiang in Go. Its syntax is explicitly modelled after jq: the same "query string + optional display function" mental model, applied to HTML.
