LinuxCommandLibrary

xidel

is a command line tool to download and extract data from HTML/XML pages.

TLDR

Print all URLs found by a Google search

$ xidel [https://www.google.com/search?q=test] --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
copy


Print the title of all pages found by a Google search and download them
$ xidel [https://www.google.com/search?q=test] --follow "[//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']]" --extract [//title] --download ['{$host}/']
copy


Follow all links on a page and print the titles, with XPath
$ xidel [https://example.org] --follow [//a] --extract [//title]
copy


Follow all links on a page and print the titles, with CSS selectors
$ xidel [https://example.org] --follow "[css('a')]" --css [title]
copy


Follow all links on a page and print the titles, with pattern matching
$ xidel [https://example.org] --follow "[<a>{.}</a>*]" --extract "[<title>{.}</title>]"
copy


Read the pattern from example.xml (which will also check if the element containing "ood" is there, and fail otherwise)
$ xidel [path/to/example.xml] --extract "[<x><foo>ood</foo><bar>{.}</bar></x>]"
copy


Print all newest Stack Overflow questions with title and URL using pattern matching on their RSS feed
$ xidel [http://stackoverflow.com/feeds] --extract "[<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+]"
copy


Check for unread Reddit mail, Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation
$ xidel [https://reddit.com] --follow "[form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})]" --extract "[css('#mail')/@title]"
copy

SYNOPSIS

The trivial usage is to extract an expression from a webpage like:

xidel http://www.example.org --extract //title

The next important option is --follow to follow links on a page. The following example will print the titles of all pages that are linked from http://www.example.org.

xidel http://www.example.org --follow //a --extract //title

DESCRIPTION

Xidel supports:

Extract expressions: there are few different kind of extract expressions

o CSS 3 Selectors : to extract simple elements
o XPath 3 : to extract values and calculate things with them
o XQuery 3 : to create new documents from the extracted values
o JSONiq : to work with JSON apis
o Templates : to extract several expressions in an easy way using a annotated version of the page for pattern-matching o Multipage templates: i.e. a file that contains templates for several pages
o XPath 2 / XQuery 1 : for legacy queries

Following:

o HTTP Codes : Redirections like 30x are automatically followed, while keeping things like cookies
o Links : It can follow all links on a page as well as some extracted values

Output formats:

o Adhoc : just prints the data in a human readable format
o XML : encodes the data in XML
o HTML : encodes the data in HTML
o JSON : encodes the data in a JSON
o bash/cmd : exports the data as shell variables

Connections: HTTP / HTTPS as well as local files or stdin
Systems: Windows (using wininet), Linux (using synapse+openssl), Mac (with newest synapse)

OPTIONS

Call xidel --help to see a list of support command line options.

Call xidel --usage for a full reference.

EXAMPLES - Basics

  1. xidel http://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

  2. xidel http://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'

Generally follow all links on a page and print the titles of the linked pages:

With XPath : xidel http://example.org -f //a -e //title

With CSS : xidel http://example.org -f "css('a')" --css title

With Templates: xidel http://example.org -f "<a>{.}</a>*" -e "<title>{.}</title>"

  1. If you have an example.xml file like "<x><foo>ood</foo><bar>IMPORTANT!</bar></x>". You can read the important part like:

xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"

(and this will also check, if the part with the ood is there, and fail otherwise)

  1. xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"

  2. xidel http://stackoverflow.com -e "<A class='question-hyperlink'>{title:=text(), url:=@href}</A>*"

  3. xidel "http://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"

  4. Webscraping, combining CSS, XPath, JSONiq and automatically form evaluation:

xidel http://reddit.com -f "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" -e "css('#mail')/@title"

Using the Reddit API:

xidel -d "user=$your_username&passwd=$your_password&api_type=json" https://ssl.reddit.com/api/login --method GET 'http://www.reddit.com/api/me.json' -e '($json).data.has_mail'

  1. xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml

  2. eval "$(xidel http://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"

This sets the bash variable $title to the title of the page and $links becomes an array of all links there

ENVIRONMENT

Use XIDEL_OPTIONS to set global command line options.

SEE ALSO

wget(1), curl(1)

AUTHOR

Benito van der Zander, <benito_NOSPAM_benibela.de>, http://www.benibela.de

Download link: http://sourceforge.net/projects/videlibri/files/Xidel/

You can test it online on http://www.videlibri.de/cgi-bin/xidelcgi or directly by sending a request to the cgi service like http://www.videlibri.de/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true

Copied to clipboard