wget

Download files from the web

TLDR

Download the contents of a URL to a file (named "foo" in this case)

$ wget [https://example.com/foo]

Download the contents of a URL to a file (named "bar" in this case)

$ wget [[-O|--output-document]] [bar] [https://example.com/foo]

Download a single web page and all its resources with 3-second intervals between requests (scripts, stylesheets, images, etc.)

$ wget [[-pkw|--page-requisites --convert-links --wait]] 3 [https://example.com/some_page.html]

Download all listed files within a directory and its sub-directories (does not download embedded page elements)

$ wget [[-mnp|--mirror --no-parent]] [https://example.com/some_path/]

Limit the download speed and the number of connection retries

$ wget --limit-rate [300k] [[-t|--tries]] [100] [https://example.com/some_path/]

Download a file from an HTTP server using Basic Auth (also works for FTP)

$ wget --user [username] --password [password] [https://example.com]

Continue an incomplete download

$ wget [[-c|--continue]] [https://example.com]

Download all URLs stored in a text file to a specific directory

$ wget [[-P|--directory-prefix]] [path/to/directory] [[-i|--input-file]] [path/to/URLs.txt]

-b, --background
    Go to background immediately after startup. If no output file is specified, it redirects output to 'wget-log'.

-c, --continue
    Resume getting a partially-downloaded file. This is useful when a download is interrupted.

-nc, --no-clobber
    Do not overwrite existing files. If a file exists, a new version is downloaded with '.N' appended to the name.

-O, --output-document=FILE
    Write documents to FILE. For multiple URLs, files are concatenated into FILE.

-P, --directory-prefix=PREFIX
    Set directory PREFIX as the prefix for all files and directories.

-r, --recursive
    Turn on recursive retrieving. This option allows wget to download an entire website.

-l, --level=LEVEL
    Specify recursion maximum depth LEVEL. Use with -r.

-np, --no-parent
    Do not ascend to the parent directory when retrieving recursively. Useful to stay within a specific directory structure.

-A, --accept=LIST
    Comma-separated list of file suffixes or patterns to accept for download.

-R, --reject=LIST
    Comma-separated list of file suffixes or patterns to reject for download.

-U, --user-agent=AGENT-STRING
    Identify as AGENT-STRING to the HTTP server. Useful for bypassing user-agent restrictions.

-q, --quiet
    Turn off wget's output. Only error messages will be displayed.

-v, --verbose
    Turn on verbose output. This is the default for most interactive uses.

-i, --input-file=FILE
    Read URLs from FILE. One URL per line.

-k, --convert-links
    Convert links in documents to make them suitable for local viewing. Useful for mirroring websites.

--limit-rate=AMOUNT
    Limit the download speed to AMOUNT bytes per second. Useful for managing network bandwidth.

-t, --tries=NUMBER
    Set number of retries to NUMBER. 0 means infinite retries.

-T, --timeout=SECONDS
    Set the read and connect timeout to SECONDS. Useful for slow or unreliable connections.

-N, --timestamping
    Turn on timestamping. Files are downloaded only if they are newer than the local copy, based on timestamps.

DESCRIPTION

wget is a free utility for non-interactive download of files from the web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. It's designed for robustness, meaning it can continue downloading even if the network connection drops, resuming where it left off. This makes it ideal for fetching large files, mirroring websites, or automating downloads within scripts. Its non-interactive nature means it can work in the background while the user is logged out, making it a powerful tool for server-side operations and cron jobs. wget can recursively download entire websites, convert links for offline viewing, and handle various authentication schemes, making it highly versatile for a wide range of data retrieval tasks.

CAVEATS

While wget is robust, certain considerations apply:

Security Risks: Care must be taken when downloading files from unknown sources, as they may contain malicious content. Recursive downloads from untrusted sites can also expose your system to risks.
Rate Limiting: Aggressive or unthrottled recursive downloads can trigger server-side rate limits or even IP bans, especially on public websites.
Website Structure Changes: When mirroring websites, dynamic content or changes in website structure can break local copies or lead to incomplete mirrors.
JavaScript/Dynamic Content: wget does not interpret JavaScript or execute client-side scripts, so it cannot fully download sites that heavily rely on dynamic content loaded via JavaScript.
Authentication: While it supports basic HTTP and FTP authentication, more complex authentication methods (e.g., OAuth, multi-factor authentication) are not directly supported.

COMMON USE CASES

wget is frequently used for:

Automated Downloads: Fetching software updates, datasets, or daily reports via cron jobs.
Website Mirroring: Creating local copies of websites for offline browsing or backup purposes.
File Recovery: Resuming large file downloads after network interruptions.
Scripting: Integrating into shell scripts for various data fetching tasks.
Benchmarking: Simple network performance testing by downloading a file and measuring time.

HISTORY

wget, originally named 'Geturl', was first released in 1996 by Hrvoje Niksic. It was designed to be a command-line tool that could download files reliably and non-interactively, particularly useful for slow or unstable internet connections common in the mid-90s. Its ability to resume interrupted downloads and perform recursive website mirroring quickly made it a staple in the Linux and Unix communities. It's written in C and is licensed under the GNU General Public License, contributing to its widespread adoption and inclusion in virtually all Linux distributions.