xidel
is a command line tool to download and extract data from HTML/XML pages.
TLDR
Print all URLs found by a Google search
Print the title of all pages found by a Google search and download them
Follow all links on a page and print the titles, with XPath
Follow all links on a page and print the titles, with CSS selectors
Follow all links on a page and print the titles, with pattern matching
Read the pattern from example.xml (which will also check if the element containing "ood" is there, and fail otherwise)
Print all newest Stack Overflow questions with title and URL using pattern matching on their RSS feed
Check for unread Reddit mail, Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation
SYNOPSIS
The trivial usage is to extract an expression from a webpage like:
xidel http://www.example.org --extract //title
The next important option is --follow to follow links on a page. The following example will print the titles of all pages that are linked from http://www.example.org.
xidel http://www.example.org --follow //a --extract //title
DESCRIPTION
Xidel supports:
- Extract expressions: there are few different kind of extract expressions
o CSS 3 Selectors : to extract simple elements
o XPath 3 : to extract values and calculate things with them
o XQuery 3 : to create new documents from the extracted values
o JSONiq : to work with JSON apis
o Templates : to extract several expressions in an easy way using a annotated version of the page for pattern-matching o Multipage templates: i.e. a file that contains templates for several pages
o XPath 2 / XQuery 1 : for legacy queries
- Following:
o HTTP Codes : Redirections like 30x are automatically followed, while keeping things like cookies
o Links : It can follow all links on a page as well as some extracted values- Output formats:
o Adhoc : just prints the data in a human readable format
o XML : encodes the data in XML
o HTML : encodes the data in HTML
o JSON : encodes the data in a JSON
o bash/cmd : exports the data as shell variables- Connections: HTTP / HTTPS as well as local files or stdin
- Systems: Windows (using wininet), Linux (using synapse+openssl), Mac (with newest synapse)
OPTIONS
Call xidel --help to see a list of support command line options.
Call xidel --usage for a full reference.
EXAMPLES - Basics
xidel http://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
xidel http://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
Generally follow all links on a page and print the titles of the linked pages:
With XPath : xidel http://example.org -f //a -e //title
With CSS : xidel http://example.org -f "css('a')" --css title
With Templates: xidel http://example.org -f "<a>{.}</a>*" -e "<title>{.}</title>"
If you have an example.xml file like "<x><foo>ood</foo><bar>IMPORTANT!</bar></x>". You can read the important part like:
xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"
(and this will also check, if the part with the ood is there, and fail otherwise)
xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"
xidel http://stackoverflow.com -e "<A class='question-hyperlink'>{title:=text(), url:=@href}</A>*"
xidel "http://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"
Webscraping, combining CSS, XPath, JSONiq and automatically form evaluation:
xidel http://reddit.com -f "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" -e "css('#mail')/@title"
Using the Reddit API:
xidel -d "user=$your_username&passwd=$your_password&api_type=json" https://ssl.reddit.com/api/login --method GET 'http://www.reddit.com/api/me.json' -e '($json).data.has_mail'
xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml
eval "$(xidel http://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"
This sets the bash variable $title to the title of the page and $links becomes an array of all links there
ENVIRONMENT
Use XIDEL_OPTIONS to set global command line options.
SEE ALSO
AUTHOR
Benito van der Zander, <benito_NOSPAM_benibela.de>, http://www.benibela.de
Download link: http://sourceforge.net/projects/videlibri/files/Xidel/
You can test it online on http://www.videlibri.de/cgi-bin/xidelcgi or directly by sending a request to the cgi service like http://www.videlibri.de/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true