LinuxCommandLibrary

linkchecker

Check website links for validity

TLDR

Find broken links on

$ linkchecker [https://example.com/]
copy

Also check URLs that point to external domains
$ linkchecker --check-extern [https://example.com/]
copy

Ignore URLs that match a specific regex
$ linkchecker --ignore-url [regex] [https://example.com/]
copy

Output results to a CSV file
$ linkchecker --file-output [csv]/[path/to/file] [https://example.com/]
copy

SYNOPSIS

linkchecker [OPTIONS] URL...
linkchecker [OPTIONS] --config-file <FILE>

PARAMETERS

-r, --recursive
    Recursively check URLs found in the specified documents or sites.

-D, --depth <N>
    Set the maximum recursion depth for checking links.

-F, --file
    Treat arguments as local file paths instead of URLs.

--config-file <FILE>
    Load configuration from a specified file instead of the default.

-O, --output <FILE>
    Write the output of the check to a specified file.

-o, --output-format <FORMAT>
    Specify the output format (e.g., text, html, csv, xml, sitemap).

--check-external
    Also check links that point to external websites.

--user-agent <AGENT>
    Set a custom User-Agent string for HTTP requests.

--timeout <SECONDS>
    Set the network timeout for connections in seconds.

--ignore-url <REGEX>
    Ignore URLs that match the given regular expression when checking.

--exclude-url <REGEX>
    Exclude URLs that match the given regular expression from crawling and checking.

-v, --verbose
    Increase the verbosity of the output, showing more details about the checking process.

-q, --quiet
    Decrease the verbosity of the output, showing only essential information.

DESCRIPTION

linkchecker is a powerful and versatile command-line tool designed to validate links within web documents or entire websites. It meticulously crawls a given URL or local file, identifying and reporting on various link issues, including broken links (404 Not Found), server errors (5xx series), malformed URLs, and unresponsive servers.

It supports checking links across multiple protocols like HTTP, HTTPS, FTP, mailto, and file paths. linkchecker can perform recursive checks on entire domains, respect robots.txt directives, follow redirects, and check external links. Its flexibility is enhanced by support for custom user agents, timeouts, and the ability to ignore or exclude specific URLs using regular expressions. Results can be exported in various formats such as plain text, HTML, CSV, XML, and Sitemaps, making it invaluable for webmasters, developers, and QA engineers ensuring the integrity and usability of web content.

CAVEATS

linkchecker is highly dependent on network conditions and the responsiveness of target servers, which can lead to slow execution times for large-scale checks or rate-limiting issues.

It primarily checks for server-side link validity and may not fully evaluate links generated or modified dynamically by client-side JavaScript in modern single-page applications. Complex authentication or highly dynamic content might require custom configurations or may not be fully supported. Users should be mindful of the load placed on target servers during extensive recursive checks.

CONFIGURATION FILES

For complex projects, linkchecker supports detailed configuration via files (e.g., .linkcheckerrc in the home directory or project root). These files allow persistent settings for ignored URLs, timeouts, authentication credentials, and other parameters, streamlining repeated checks and integration into CI/CD pipelines.

EXIT CODES

linkchecker provides meaningful exit codes that are crucial for scripting and automation. An exit code of 0 typically indicates no broken links were found, while 1 (or other non-zero values) signifies that errors were detected, allowing automated systems to react accordingly and trigger alerts or further actions.

HISTORY

linkchecker was initially developed by Christian Kreibich and first released publicly around the year 2000. Written in Python, it has evolved as a community-maintained open-source project, hosted on platforms like GitHub. Its development has focused on robust link validation for web documents and websites, making it a staple for web integrity checks over two decades.

SEE ALSO

wget(1), curl(1), lynx(1), httrack(1)

Copied to clipboard