LinuxCommandLibrary

nokogiri

Parse HTML and XML documents

TLDR

Parse the contents of a URL or file

$ nokogiri [url|path/to/file]
copy

Parse as a specific type
$ nokogiri [url|path/to/file] --type [xml|html]
copy

Load a specific initialization file before parsing
$ nokogiri [url|path/to/file] -C [path/to/config_file]
copy

Parse using a specific encoding
$ nokogiri [url|path/to/file] [[-E|--encoding]] [encoding]
copy

Validate using a RELAX NG file
$ nokogiri [url|path/to/file] --rng [url|path/to/file]
copy

SYNOPSIS

nokogiri-diff [OPTIONS] FILE1 FILE2

PARAMETERS

--help, -h
    Displays a help message and exits.

--version, -v
    Prints the version information and exits.

--verbose
    Increases verbosity of output, showing more details about the comparison process.

--format <FORMAT>
    Specifies the output format for differences (e.g., `text`, `xml`, `html`).

--ignore-whitespace
    Ignores differences in whitespace during comparison.

--ignore-attributes <ATTR>
    Ignores differences in specified attributes (e.g., `id`, `class`). Can be repeated for multiple attributes.

--ignore-children
    Ignores differences in child nodes, focusing only on the current node's content and attributes.

FILE1
    The path to the first XML or HTML file for comparison.

FILE2
    The path to the second XML or HTML file for comparison.

DESCRIPTION

The term "nokogiri" primarily refers to a powerful Ruby library for parsing and manipulating HTML and XML documents. While `nokogiri` itself is not a direct standalone Linux command in the traditional sense (like `ls` or `grep`), the `nokogiri` gem often installs a utility called `nokogiri-diff`.

This utility provides a command-line interface for comparing two XML or HTML files. It leverages the Nokogiri library's parsing capabilities to identify and present the structural and content differences between the two documents. It's particularly useful for development workflows, testing, or auditing changes in web content or data feeds.

Its presence as an executable demonstrates a practical command-line application of the underlying Nokogiri library's power beyond typical Ruby script usage.

CAVEATS

The `nokogiri-diff` command is a specific utility for document comparison, not a general-purpose XML/HTML parsing or manipulation tool from the command line. Its functionality is limited to identifying differences. For more complex operations like extracting data, modifying structures, or validating documents, direct use of the Nokogiri Ruby library within a Ruby script is necessary. Its availability depends on the `nokogiri` gem being installed on the system, and it might not be in the default system PATH depending on the Ruby installation method.

INSTALLATION

To use `nokogiri-diff`, you must first have Ruby installed, and then install the `nokogiri` gem. This is typically done via gem install nokogiri or by including gem 'nokogiri' in a Gemfile and running bundle install. Ensure that necessary system libraries (like libxml2 and libxslt) are also present, as Nokogiri relies on them for its C extensions.

COMPARISON LOGIC

nokogiri-diff performs a structural comparison, meaning it understands the tree-like nature of XML/HTML documents. It doesn't just compare text lines; it identifies changes in elements, attributes, and text content within the document's hierarchy, making it more intelligent for structured data than a simple line-by-line diff.

HISTORY

Nokogiri itself was first released in 2008, developed by Aaron Patterson and others. It rapidly became the de facto standard for XML and HTML parsing in the Ruby ecosystem due to its speed (being C-based, leveraging libxml2 and libxslt) and powerful API. The `nokogiri-diff` utility, while not as prominent as the library itself, has been included as a bundled executable script within the `nokogiri` gem for many years, providing a convenient command-line interface for a specific common task: comparing structured documents, directly leveraging the library's robust parsing capabilities.

SEE ALSO

diff(1): Compares two files line by line., xmllint(1): A command-line XML tool from libxml2, often used for parsing, validating, and formatting XML., tidy(1): A command-line tool for cleaning up and validating HTML., gem(1): The RubyGems package manager command, used for installing and managing Ruby gems like `nokogiri`., bundle(1): Bundler, a dependency manager for Ruby, often used to install gems including Nokogiri.

Copied to clipboard