monolith
Save web page as single HTML file
TLDR
Save a webpage as a single HTML file
Save a webpage as a single HTML file, excluding audio
Save a webpage as a single HTML file, excluding CSS
Save a webpage as a single HTML file, excluding images
Save a webpage as a single HTML file, excluding videos
Save a webpage as a single HTML file, excluding JavaScript
Save a webpage as a single HTML file, accepting invalid TLS certificates
Save a webpage as a single HTML file, specifying a specific output file
SYNOPSIS
monolith
[OPTIONS] URL...monolith
[OPTIONS] - (to read HTML from standard input)
PARAMETERS
-V, --version
Show the program's version and exit.
-o, --output
Specify the output file name (defaults to standard output).
-a, --no-audio
Do not embed audio elements.
-v, --no-video
Do not embed video elements.
-i, --no-images
Do not embed image elements.
-f, --no-fonts
Do not embed web fonts.
-j, --no-js
Do not embed JavaScript code.
-s, --no-css
Do not embed CSS stylesheets.
-c, --no-canvas
Do not embed canvas elements.
-m, --no-manifest
Do not embed web app manifest files.
-p, --no-frames
Do not embed iframe or frame elements.
-x, --no-xmp
Do not embed XMP metadata.
-k, --no-svg
Do not embed SVG elements.
-t, --timeout
Set the network request timeout in seconds.
-r, --retries
Set the number of network request retries.
-U, --user-agent
Set a custom User-Agent string for requests.
-H, --header
Set custom HTTP headers (can be specified multiple times).
-b, --base-url
Set the base URL for resolving relative paths.
-d, --document-root
Set the document root for resolving local relative paths.
-P, --pretty
Pretty-print the output HTML for readability.
-l, --level [0|1|2|3]
Set the log level (0: quiet, 1: error, 2: info, 3: debug).
-S, --silent
Suppress all output except errors.
-e, --exclude
Exclude resources whose URLs match the provided regular expression.
-I, --include
Include only resources whose URLs match the provided regular expression.
--isolate
Isolate resources to the same domain as the document.
--keep-links
Do not rewrite external links (they will remain external).
--backend [requests|playwright]
Specify the rendering backend (e.g., requests for simple fetches, playwright for complex JS rendering).
--disable-strict-ssl
Disable strict SSL certificate verification.
--encoding
Force the input HTML encoding.
--user
Provide HTTP Basic Authentication credentials.
--no-robots
Ignore robots.txt directives.
--no-content-security-policy
Do not remove Content-Security-Policy headers.
--host
Bind to a specific host address for network requests.
--proxy
Use a proxy for network requests (e.g., http://localhost:8080).
--skip-inlined
Skip processing of inlined resources (e.g., base64 data URIs).
--ignore-errors
Continue processing even if errors occur during resource fetching.
--disable-css-float
Disable CSS float processing.
--disable-media-query
Disable media query processing.
--no-shadow-dom
Do not embed shadow DOM content.
--unreadable
Make the output HTML unreadable (opposite of --pretty).
--no-color
Disable colored output for logs.
-h, --help
Show the help message and exit.
DESCRIPTION
monolith
is a powerful command-line utility designed to archive web pages into a single, self-contained HTML file. It intelligently embeds external assets such as images, CSS stylesheets, JavaScript code, fonts, and other media directly into the HTML document. This unique capability makes the archived page fully functional offline, eliminating dependencies on external servers or internet connectivity for linked resources.
It serves as an excellent tool for various purposes, including offline reading, long-term archival, content sharing, and creating portable web documents. monolith
provides extensive options to control which asset types are embedded, how external links are handled, and the output format, offering a highly customizable archiving experience. Its simplicity and effectiveness make it a go-to solution for preserving dynamic web content in a static, accessible format.
CAVEATS
monolith
is not a standard component of most Linux distributions; it typically requires manual installation (e.g., via pip).
It relies on external Python libraries (such as requests and lxml) and optionally headless browsers (like Playwright) for advanced rendering capabilities.
Complex JavaScript-driven applications or dynamically loaded content might not render perfectly if the tool cannot execute all JavaScript or if content relies on external API calls not captured.
Archiving very large or complex web pages can consume significant memory and CPU resources.
Security implications exist if used to process untrusted URLs, as it downloads and processes external content.
INSTALLATION
monolith
is typically installed using Python's package manager, pip:
pip install monolith
For advanced rendering with JavaScript execution, you might need to install Playwright and its browser binaries:
pip install 'monolith[playwright]'
playwright install
USAGE EXAMPLES
Archive a web page and save it to a file:
monolith https://example.com -o example.html
Archive a page without embedding images or JavaScript:
monolith -i -j https://example.com -o no_media.html
Pretty-print the output HTML for readability:
monolith --pretty https://example.com -o pretty_example.html
Archive a local HTML file from standard input:
cat my_local_page.html | monolith -o archived_local.html
HISTORY
The monolith
command-line tool, particularly the Python implementation by Yash Kandarkori, emerged to address the growing need for a simple and robust way to archive web pages into a single, portable file. Traditional browser 'save as' features often create complex folders of assets, while monolith
streamlines this by embedding all necessary assets directly into the HTML.
Leveraging Python's powerful libraries like lxml for parsing HTML and requests for handling HTTP, its development focused on creating a highly customizable and dependency-free output. It quickly gained traction among users requiring offline access, long-term preservation, and simplified sharing of web content, evolving to support various asset types and advanced rendering options over time.
SEE ALSO
wget(1): Command-line utility for retrieving files using HTTP, HTTPS, and FTP. Primarily for downloading, not single-file archiving., curl(1): A versatile command-line tool for transferring data with URLs. Similar to wget, but more focused on data transfer., httrack(1): A free, open-source web crawler that allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer., ArchiveBox (tool): A more comprehensive self-hosted web archiving solution, often used for personal archives., SingleFile (browser extension): A browser-based tool that saves a complete web page as a single HTML file, similar to monolith but integrated within a web browser.