xidel
Extract data from HTML/XML using CSS/XPath
TLDR
Print all URLs found by a Google search
Print the title of all pages found by a Google search and download them
Follow all links on a page and print the titles, with XPath
Follow all links on a page and print the titles, with CSS selectors
Follow all links on a page and print the titles, with pattern matching
Read the pattern from example.xml (which will also check if the element containing "ood" is there, and fail otherwise)
Print all newest Stack Overflow questions with title and URL using pattern matching on their RSS feed
Check for unread Reddit mail, Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation
SYNOPSIS
xidel [OPTIONS] [URL | FILE...] [EXPRESSION...]
Common usage examples:
xidel URL -e EXPR
xidel FILE -s
xidel -e EXPR INPUT...
PARAMETERS
-e, --extract expression
Specifies the XPath, XQuery, or JSONiq expression to evaluate and extract data.
-s, --silent
Suppresses progress and status messages, useful for scripting.
-f, --follow
Follows links recursively, depth can be controlled via XQuery. Use with caution to avoid excessive requests.
-o, --output file
Writes the output to the specified file instead of standard output.
-H, --header 'Header: Value'
Adds a custom HTTP header to the request.
-u, --user-agent string
Sets the User-Agent HTTP string for the request.
-d, --data data
Sends specified data in a POST request body. Can be used multiple times.
-X, --request method
Specifies the HTTP method (e.g., GET, POST, PUT) for the request.
-v, --variable name=value
Defines an external variable accessible within the XQuery or JSONiq expression.
--output-format format
Specifies the output format (e.g., text, xml, html, json) of the extracted data.
--xpath
Forces interpretation of the expression as XPath.
--xquery
Forces interpretation of the expression as XQuery.
--jsoniq
Forces interpretation of the expression as JSONiq.
--download
Downloads the specified URLs directly, bypassing XQuery processing. Similar to wget.
--connect-timeout seconds
Sets a maximum time (in seconds) for the connection to be established.
DESCRIPTION
xidel is a versatile command-line tool designed for parsing, transforming, and extracting data from HTML, XML, and JSON documents. It uniquely combines the capabilities of a web downloader (like wget or curl) with powerful querying languages: XPath 3.1, XQuery 3.1, and JSONiq. This allows users to download web pages and then immediately apply sophisticated queries to extract specific data, such as links, text content, or structured information from tables.
Beyond simple data extraction, xidel supports complex operations including form submission, managing sessions, and handling dynamic content. Its ability to process multiple inputs, execute external XQuery/JSONiq modules, and format output in various ways makes it an invaluable asset for web scraping, data automation, and general document processing tasks, bridging the gap between raw data fetching and structured data manipulation.
CAVEATS
xidel does not execute client-side JavaScript, meaning it cannot interact with or render dynamic content that relies heavily on JavaScript execution. For such dynamic web pages, alternative tools like headless browsers (e.g., Puppeteer, Selenium) are required. Its powerful query languages, XPath, XQuery, and JSONiq, can have a steep learning curve for new users. Additionally, recursive fetching with -f should be used with caution as it can quickly consume bandwidth and system resources, potentially leading to rate limiting or even IP blocking.
LANGUAGE SUPPORT
xidel provides comprehensive support for XPath 3.1, XQuery 3.1, and JSONiq, enabling highly precise and powerful data extraction and manipulation capabilities. This makes it suitable for complex transformations beyond simple text matching or regular expressions.
HANDLING MULTIPLE INPUTS
xidel can process multiple URLs or local files sequentially or concurrently, allowing for batch processing of documents or websites. This is often combined with recursive fetching or iterative scripting, making it efficient for large-scale data collection.
HISTORY
xidel was originally developed by Marc Loevenich and first released around 2011. It was designed to fill a niche by combining robust HTML/XML/JSON parsing with the expressive power of W3C standard query languages like XPath and XQuery directly from the command line. Unlike other tools that might require chaining curl or wget with xmllint or custom scripts, xidel aimed to provide an all-in-one solution for sophisticated web data extraction and transformation. Its development has continued to add support for newer standards like JSONiq and improve HTML5 parsing, making it a continuously relevant tool for modern web scraping and data processing tasks.