httrack
Download websites for offline viewing
SYNOPSIS
httrack [options] <URL>...
httrack [project_name [URL]] [options]
PARAMETERS
-O <path>
Specifies the output directory where the mirrored website will be saved. For example, -O /var/www/mirror.
-r <N>
Sets the recursion depth level. Defines how many link levels deep HTTrack will follow from the initial URL. A value of 0 means only the initial page.
-%L
Enables display of the transfer rate (kB/s) during the mirroring process, providing real-time feedback.
-%P
Displays the download progress bar, which is useful for long-running operations.
-F <User-Agent>
Sets the User-Agent header for HTTP requests. Can be used to mimic different browsers or to bypass simple bots checks.
-s0
Disables the local file structure creation ('html', 'images' etc.). Files are saved directly into the target directory, preserving the server's directory structure if possible.
-v
Activates verbose output, showing more detailed information about the download process and encountered issues.
-q
Enables quiet mode, suppressing most messages during the download process, useful for scripting.
-p <proxy>
Configures HTTrack to use a proxy server (e.g., -p http://proxy.example.com:8080).
-u
Updates an existing mirror, only downloading new or changed files.
--robots=0
Ignores the robots.txt exclusion rules. Use with caution and respect website policies.
DESCRIPTION
HTTrack is a powerful, free, and open-source command-line utility for Linux that allows users to download a World Wide Web site from the Internet to a local directory.
It builds recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack then arranges the original site's relative link-structure, so you can simply open a page of the mirrored website in your browser and browse the site from link to link, as if you were viewing it online.
This tool is exceptionally useful for creating offline archives of websites, developing and testing web content locally without an internet connection, or analyzing a site's structure and content. It supports resuming interrupted downloads, updating existing mirrored sites, and applying various filters to control what content is downloaded.
CAVEATS
When using HTTrack, it's crucial to be mindful of a few points:
Legal & Ethical Considerations: Always respect the website's robots.txt file and its terms of service. Mirroring a site without permission or excessively burdening a server can be considered unethical or even illegal.
Dynamic Content: HTTrack primarily downloads static content (HTML, CSS, images, JS files). Websites heavily reliant on client-side JavaScript, AJAX, or server-side databases for dynamic content might not function perfectly when mirrored offline.
Resource Consumption: Mirroring large websites can consume significant disk space and bandwidth on both your machine and the target server. Be prepared for potentially long download times and large storage requirements.
COMMON USE CASES
- Website Archiving: Create a persistent offline copy of websites for historical preservation or personal reference.
- Offline Browsing: Access web content without an internet connection, ideal for travel or areas with poor connectivity.
- Local Development & Testing: Download a site to analyze its structure, extract assets, or perform local testing without hitting the live server.
- Data Extraction: Useful for extracting specific data or content types from a website for analysis or redistribution (within legal limits).
BEST PRACTICES
- Start with a small recursion depth (e.g., -r1 or -r2) to understand how the site is mirrored before attempting a full download.
- Always check the website's robots.txt file (e.g., http://example.com/robots.txt) to understand what parts of the site are disallowed for automated crawlers.
- Consider using filters (+<pattern>, -<pattern>) to include or exclude specific file types or URLs, reducing download size and time.
- Use the -u option to update an existing mirror, rather than redownloading the entire site.
HISTORY
HTTrack was originally developed by Xavier Roche and first released in 1998. It quickly gained popularity as a robust and reliable tool for website mirroring. Over the years, it has been continually updated and improved, maintaining its status as one of the leading open-source solutions for offline browsing and web archiving. Its development has focused on robustness, flexibility, and adherence to web standards.