LinuxCommandLibrary

monolith

Save web page as single HTML file

TLDR

Save a webpage as a single HTML file

$ monolith [url]
copy

Save a webpage as a single HTML file, excluding audio
$ monolith [url] [[-a|--no-audio]]
copy

Save a webpage as a single HTML file, excluding CSS
$ monolith [url] [[-c|--no-css]]
copy

Save a webpage as a single HTML file, excluding images
$ monolith [url] [[-i|--no-images]]
copy

Save a webpage as a single HTML file, excluding videos
$ monolith [url] [[-v|--no-video]]
copy

Save a webpage as a single HTML file, excluding JavaScript
$ monolith [url] [[-j|--no-js]]
copy

Save a webpage as a single HTML file, accepting invalid TLS certificates
$ monolith [url] [[-k|--insecure]]
copy

Save a webpage as a single HTML file, specifying a specific output file
$ monolith [url] [[-o|--output]] [path/to/file.html]
copy

SYNOPSIS

monolith [OPTIONS] URL...
monolith [OPTIONS] - (to read HTML from standard input)

PARAMETERS

-V, --version
    Show the program's version and exit.

-o, --output
    Specify the output file name (defaults to standard output).

-a, --no-audio
    Do not embed audio elements.

-v, --no-video
    Do not embed video elements.

-i, --no-images
    Do not embed image elements.

-f, --no-fonts
    Do not embed web fonts.

-j, --no-js
    Do not embed JavaScript code.

-s, --no-css
    Do not embed CSS stylesheets.

-c, --no-canvas
    Do not embed canvas elements.

-m, --no-manifest
    Do not embed web app manifest files.

-p, --no-frames
    Do not embed iframe or frame elements.

-x, --no-xmp
    Do not embed XMP metadata.

-k, --no-svg
    Do not embed SVG elements.

-t, --timeout
    Set the network request timeout in seconds.

-r, --retries
    Set the number of network request retries.

-U, --user-agent
    Set a custom User-Agent string for requests.

-H, --header ...
    Set custom HTTP headers (can be specified multiple times).

-b, --base-url
    Set the base URL for resolving relative paths.

-d, --document-root
    Set the document root for resolving local relative paths.

-P, --pretty
    Pretty-print the output HTML for readability.

-l, --level [0|1|2|3]
    Set the log level (0: quiet, 1: error, 2: info, 3: debug).

-S, --silent
    Suppress all output except errors.

-e, --exclude
    Exclude resources whose URLs match the provided regular expression.

-I, --include
    Include only resources whose URLs match the provided regular expression.

--isolate
    Isolate resources to the same domain as the document.

--keep-links
    Do not rewrite external links (they will remain external).

--backend [requests|playwright]
    Specify the rendering backend (e.g., requests for simple fetches, playwright for complex JS rendering).

--disable-strict-ssl
    Disable strict SSL certificate verification.

--encoding
    Force the input HTML encoding.

--user
    Provide HTTP Basic Authentication credentials.

--no-robots
    Ignore robots.txt directives.

--no-content-security-policy
    Do not remove Content-Security-Policy headers.

--host
    Bind to a specific host address for network requests.

--proxy
    Use a proxy for network requests (e.g., http://localhost:8080).

--skip-inlined
    Skip processing of inlined resources (e.g., base64 data URIs).

--ignore-errors
    Continue processing even if errors occur during resource fetching.

--disable-css-float
    Disable CSS float processing.

--disable-media-query
    Disable media query processing.

--no-shadow-dom
    Do not embed shadow DOM content.

--unreadable
    Make the output HTML unreadable (opposite of --pretty).

--no-color
    Disable colored output for logs.

-h, --help
    Show the help message and exit.

DESCRIPTION

monolith is a powerful command-line utility designed to archive web pages into a single, self-contained HTML file. It intelligently embeds external assets such as images, CSS stylesheets, JavaScript code, fonts, and other media directly into the HTML document. This unique capability makes the archived page fully functional offline, eliminating dependencies on external servers or internet connectivity for linked resources.

It serves as an excellent tool for various purposes, including offline reading, long-term archival, content sharing, and creating portable web documents. monolith provides extensive options to control which asset types are embedded, how external links are handled, and the output format, offering a highly customizable archiving experience. Its simplicity and effectiveness make it a go-to solution for preserving dynamic web content in a static, accessible format.

CAVEATS

monolith is not a standard component of most Linux distributions; it typically requires manual installation (e.g., via pip).
It relies on external Python libraries (such as requests and lxml) and optionally headless browsers (like Playwright) for advanced rendering capabilities.
Complex JavaScript-driven applications or dynamically loaded content might not render perfectly if the tool cannot execute all JavaScript or if content relies on external API calls not captured.
Archiving very large or complex web pages can consume significant memory and CPU resources.
Security implications exist if used to process untrusted URLs, as it downloads and processes external content.

INSTALLATION

monolith is typically installed using Python's package manager, pip:
pip install monolith

For advanced rendering with JavaScript execution, you might need to install Playwright and its browser binaries:
pip install 'monolith[playwright]'
playwright install

USAGE EXAMPLES

Archive a web page and save it to a file:
monolith https://example.com -o example.html

Archive a page without embedding images or JavaScript:
monolith -i -j https://example.com -o no_media.html

Pretty-print the output HTML for readability:
monolith --pretty https://example.com -o pretty_example.html

Archive a local HTML file from standard input:
cat my_local_page.html | monolith -o archived_local.html

HISTORY

The monolith command-line tool, particularly the Python implementation by Yash Kandarkori, emerged to address the growing need for a simple and robust way to archive web pages into a single, portable file. Traditional browser 'save as' features often create complex folders of assets, while monolith streamlines this by embedding all necessary assets directly into the HTML.

Leveraging Python's powerful libraries like lxml for parsing HTML and requests for handling HTTP, its development focused on creating a highly customizable and dependency-free output. It quickly gained traction among users requiring offline access, long-term preservation, and simplified sharing of web content, evolving to support various asset types and advanced rendering options over time.

SEE ALSO

wget(1): Command-line utility for retrieving files using HTTP, HTTPS, and FTP. Primarily for downloading, not single-file archiving., curl(1): A versatile command-line tool for transferring data with URLs. Similar to wget, but more focused on data transfer., httrack(1): A free, open-source web crawler that allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer., ArchiveBox (tool): A more comprehensive self-hosted web archiving solution, often used for personal archives., SingleFile (browser extension): A browser-based tool that saves a complete web page as a single HTML file, similar to monolith but integrated within a web browser.

Copied to clipboard