LinuxCommandLibrary

xidel

Extract data from HTML/XML using CSS/XPath

TLDR

Print all URLs found by a Google search

$ xidel [https://www.google.com/search?q=test] [[-e|--extract]] "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
copy

Print the title of all pages found by a Google search and download them
$ xidel [https://www.google.com/search?q=test] [[-f|--follow]] "[//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']]" [[-e|--extract]] [//title] --download ['{$host}/']
copy

Follow all links on a page and print the titles, with XPath
$ xidel [https://example.org] [[-f|--follow]] [//a] [[-e|--extract]] [//title]
copy

Follow all links on a page and print the titles, with CSS selectors
$ xidel [https://example.org] [[-f|--follow]] "[css('a')]" --css [title]
copy

Follow all links on a page and print the titles, with pattern matching
$ xidel [https://example.org] [[-f|--follow]] "[<a>{.}</a>*]" [[-e|--extract]] "[<title>{.}</title>]"
copy

Read the pattern from example.xml (which will also check if the element containing "ood" is there, and fail otherwise)
$ xidel [path/to/example.xml] [[-e|--extract]] "[<x><foo>ood</foo><bar>{.}</bar></x>]"
copy

Print all newest Stack Overflow questions with title and URL using pattern matching on their RSS feed
$ xidel [http://stackoverflow.com/feeds] [[-e|--extract]] "[<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+]"
copy

Check for unread Reddit mail, Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation
$ xidel [https://reddit.com] [[-f|--follow]] "[form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})]" [[-e|--extract]] "[css('#mail')/@title]"
copy

SYNOPSIS

xidel [OPTIONS] [URL | FILE...] [EXPRESSION...]


Common usage examples:

xidel URL -e EXPR
xidel FILE -s
xidel -e EXPR INPUT...

PARAMETERS

-e, --extract expression
    Specifies the XPath, XQuery, or JSONiq expression to evaluate and extract data.

-s, --silent
    Suppresses progress and status messages, useful for scripting.

-f, --follow
    Follows links recursively, depth can be controlled via XQuery. Use with caution to avoid excessive requests.

-o, --output file
    Writes the output to the specified file instead of standard output.

-H, --header 'Header: Value'
    Adds a custom HTTP header to the request.

-u, --user-agent string
    Sets the User-Agent HTTP string for the request.

-d, --data data
    Sends specified data in a POST request body. Can be used multiple times.

-X, --request method
    Specifies the HTTP method (e.g., GET, POST, PUT) for the request.

-v, --variable name=value
    Defines an external variable accessible within the XQuery or JSONiq expression.

--output-format format
    Specifies the output format (e.g., text, xml, html, json) of the extracted data.

--xpath
    Forces interpretation of the expression as XPath.

--xquery
    Forces interpretation of the expression as XQuery.

--jsoniq
    Forces interpretation of the expression as JSONiq.

--download
    Downloads the specified URLs directly, bypassing XQuery processing. Similar to wget.

--connect-timeout seconds
    Sets a maximum time (in seconds) for the connection to be established.

DESCRIPTION

xidel is a versatile command-line tool designed for parsing, transforming, and extracting data from HTML, XML, and JSON documents. It uniquely combines the capabilities of a web downloader (like wget or curl) with powerful querying languages: XPath 3.1, XQuery 3.1, and JSONiq. This allows users to download web pages and then immediately apply sophisticated queries to extract specific data, such as links, text content, or structured information from tables.

Beyond simple data extraction, xidel supports complex operations including form submission, managing sessions, and handling dynamic content. Its ability to process multiple inputs, execute external XQuery/JSONiq modules, and format output in various ways makes it an invaluable asset for web scraping, data automation, and general document processing tasks, bridging the gap between raw data fetching and structured data manipulation.

CAVEATS

xidel does not execute client-side JavaScript, meaning it cannot interact with or render dynamic content that relies heavily on JavaScript execution. For such dynamic web pages, alternative tools like headless browsers (e.g., Puppeteer, Selenium) are required. Its powerful query languages, XPath, XQuery, and JSONiq, can have a steep learning curve for new users. Additionally, recursive fetching with -f should be used with caution as it can quickly consume bandwidth and system resources, potentially leading to rate limiting or even IP blocking.

LANGUAGE SUPPORT

xidel provides comprehensive support for XPath 3.1, XQuery 3.1, and JSONiq, enabling highly precise and powerful data extraction and manipulation capabilities. This makes it suitable for complex transformations beyond simple text matching or regular expressions.

HANDLING MULTIPLE INPUTS

xidel can process multiple URLs or local files sequentially or concurrently, allowing for batch processing of documents or websites. This is often combined with recursive fetching or iterative scripting, making it efficient for large-scale data collection.

HISTORY

xidel was originally developed by Marc Loevenich and first released around 2011. It was designed to fill a niche by combining robust HTML/XML/JSON parsing with the expressive power of W3C standard query languages like XPath and XQuery directly from the command line. Unlike other tools that might require chaining curl or wget with xmllint or custom scripts, xidel aimed to provide an all-in-one solution for sophisticated web data extraction and transformation. Its development has continued to add support for newer standards like JSONiq and improve HTML5 parsing, making it a continuously relevant tool for modern web scraping and data processing tasks.

SEE ALSO

curl(1), wget(1), xmllint(1), jq(1), pup(1)

Copied to clipboard