tidy
Correct and clean up HTML, XML, and XHTML
TLDR
Pretty print an HTML file
Enable indentation, wrapping lines in 100, saving to output.html
Modify an HTML file in-place using a configuration file
SYNOPSIS
tidy [options] [input_file]
tidy [options] [input_file] -o output_file
cat input_file | tidy [options] > output_file
The tidy command processes an HTML, XHTML, or XML file (or standard input) and writes the cleaned, corrected, or converted output to standard output or a specified file. options control the behavior and formatting.
PARAMETERS
-o
Specifies the output file. If omitted, output goes to standard output.
-m, --modify
Modifies the input file(s) in place, overwriting the original with the tidied version. Use with caution.
-i, --indent
Enables intelligent indentation of elements to improve readability. Can be further configured with --indent-spaces.
-c, --clean
Removes proprietary HTML tags and attributes, often from word processors, to produce cleaner, standard-compliant markup.
-xml, --xml
Treats the input as generic XML, ensuring well-formedness and pretty-printing XML structures.
-asxml, --asxml
Outputs the document as XML, potentially converting HTML to a well-formed XML syntax.
-xhtml, --asxhtml
Outputs the document as XHTML, enforcing XHTML syntax rules.
-html, --ashtml
Outputs the document as HTML, typically HTML5, correcting syntax to modern HTML standards.
-q, --quiet
Suppresses output of warnings and errors, useful for automated scripts where only the tidied output is desired.
-e, --errors
Displays only errors, suppressing warnings. Useful for focusing on critical issues.
-config
Specifies an alternative configuration file to load settings from, overriding default or environmental configurations.
--doctype
Sets the DOCTYPE declaration. Common values include 'auto', 'omit', 'strict', 'transitional', 'html5'.
--wrap
Specifies the maximum column width for wrapping text and tags. A value of 0 disables wrapping.
-h, --help
Displays a comprehensive help message listing all command-line options and configuration settings.
-v, --version
Prints the version information of the tidy utility.
DESCRIPTION
The tidy command is a versatile command-line utility and library designed to clean up, correct, and format HTML, XHTML, and XML documents. Developed by Dave Raggett of the W3C, it serves as an invaluable tool for web developers, content creators, and automated systems. tidy can fix common markup errors, ensure compliance with web standards, and pretty-print documents for better readability.
Its core functionalities include error detection and correction (e.g., missing closing tags, unquoted attributes), conversion between different markup standards (like HTML to XHTML), and enforcement of accessibility guidelines. Users can configure tidy extensively to control output format, tag casing, indentation, and much more, either via command-line options or dedicated configuration files. It's frequently used in automated workflows for validating web content, preparing documents for publication, or simply improving the quality and consistency of markup.
CAVEATS
tidy is a powerful tool, but users should be aware of certain limitations and behaviors:
Over-correction: While designed to fix errors, in some rare cases, tidy might interpret unconventional but valid markup as an error and alter it in an unintended way.
Strictness: For very malformed or proprietary markup, tidy might struggle to produce perfectly clean output or might report numerous errors that require manual intervention.
XML Validity vs. Well-formedness: When treating input as XML or outputting as XML, tidy primarily ensures XML well-formedness (correct syntax) but does not validate against a DTD or XML Schema (which requires a dedicated XML validator like xmllint).
Configuration Complexity: With hundreds of possible configuration options, mastering tidy's full capabilities and fine-tuning its behavior for specific needs can require significant effort.
CONFIGURATION FILES
tidy can load configuration options from a file, typically named .tidyrc or tidy.conf. This allows users to define a set of default or project-specific settings, avoiding the need to specify numerous command-line options every time. The --config-file option can be used to specify a custom configuration file path.
INTEGRATION INTO WORKFLOWS
Due to its command-line interface and robust capabilities, tidy is frequently integrated into automated build systems, continuous integration pipelines, and text editor plugins. It's used for pre-commit checks, ensuring code quality, converting legacy documents, and preparing content for web publication.
ERROR AND WARNING REPORTING
Beyond simply cleaning markup, tidy provides detailed reports of errors and warnings encountered during processing. This diagnostic output is invaluable for debugging HTML issues, understanding standard violations, and improving overall document quality. The verbosity of these reports can be controlled with options like --quiet or --errors-only.
HISTORY
HTML Tidy was originally developed by Dave Raggett, a leading figure at the World Wide Web Consortium (W3C), with its first public release around 1998. Raggett created tidy to help web authors produce cleaner, more compliant HTML markup, addressing the common issues of poorly structured or non-standard code prevalent in the early days of the web. It quickly became an indispensable tool for validating web content and ensuring adherence to HTML and XHTML standards.
While initially developed and maintained by the W3C, its development has since transitioned to an open-source community effort. Various forks and projects have emerged over time to continue its maintenance and evolution, notably the HTML Tidy Project on SourceForge and later on GitHub, ensuring its continued relevance and adaptation to new web standards like HTML5. Its enduring presence underscores its utility in automated web development workflows and quality assurance processes.