katana

Web security reconnaissance and information gathering

TLDR

Crawl a list of URLs

$ katana -list [https://example.com,https://google.com,...]

Crawl a [u]RL using headless mode using Chromium

$ katana -u [https://example.com] [[-hl|-headless]]

Pass requests through a proxy (http/socks5) and use custom headers from a file

$ katana -proxy [http://127.0.0.1:8080] [[-H|-headers]] [path/to/headers.txt] -u [https://example.com]

Specify the crawling strategy, depth of subdirectories to crawl, and rate limiting (requests per second)

$ katana [[-s|-strategy]] [depth-first|breadth-first] [[-d|-depth]] [value] [[-rl|-rate-limit]] [value] -u [https://example.com]

Find subdomains using subfinder, crawl each for a maximum number of seconds, and write results to an output file

$ subfinder [[-dL|-list]] [path/to/domains.txt] | katana [[-ct|-crawl-duration]] [value] [[-o|-output]] [path/to/output.txt]

SYNOPSIS

katana [OPTIONS] -u <TARGET_URL> | -l <URL_LIST_FILE>
katana [OPTIONS] -rl <RAW_REQUEST_FILE>

-u <target_url>
    Specifies the target URL to start crawling from.

-l <list_file>
    Provides a file containing a list of URLs to crawl.

-rl <raw_request_file>
    Specifies a file containing raw HTTP requests to use as entry points for crawling.

-o <output_file>
    Writes the crawling results to the specified output file.

-json
    Outputs the results in JSON format.

-depth <level>
    Sets the maximum crawling depth (default: unlimited).

-headless
    Enables headless browser support for JavaScript-rendered content. Requires Chrome/Chromium installation.

-js-crawl
    Performs JavaScript-based crawling using the headless browser.

-c <concurrency>
    Sets the maximum number of concurrent requests.

-timeout <seconds>
    Sets the maximum timeout for HTTP requests in seconds.

-exclude-cdn
    Excludes URLs belonging to known CDN providers.

-passive
    Enables passive source discovery (e.g., from CommonCrawl, WayBack Machine).

-proxy <proxy_url>
    Uses an HTTP/SOCKS proxy for all requests (e.g., http://127.0.0.1:8080).

-H <header>
    Adds a custom HTTP header to requests (e.g., 'User-Agent: Katana Crawler').

-v
    Enables verbose output, showing more details during crawling.

-version
    Displays the current version of Katana.

-config <file>
    Specifies a custom configuration file for Katana.

-field-config <file>
    Specifies a custom field configuration file for data extraction.

-crawl-duration <duration>
    Sets the maximum duration for crawling (e.g., '5m' for 5 minutes).

-rate-limit <rpm>
    Sets a rate limit for requests per minute.

DESCRIPTION

Katana is a powerful, Go-based web crawling and spidering framework designed for speed and extensibility. Developed by ProjectDiscovery, it's widely used in reconnaissance, bug bounty hunting, and penetration testing workflows. Katana excels at efficiently discovering URLs, forms, and other interactable elements within a target scope. It supports various input methods, including single URLs or lists of URLs, and can output results in diverse formats like plain text or JSON. A key feature is its ability to perform JavaScript-based crawling using headless browsers, allowing it to navigate modern web applications that rely heavily on client-side rendering. This makes it highly effective for uncovering hidden endpoints and functionalities that traditional static crawlers might miss. Its modular design allows users to integrate it seamlessly into larger automation pipelines.

CAVEATS

Katana's headless browser functionality requires additional system resources (memory and CPU), especially for complex web applications or high concurrency. Users must ensure they have explicit permission before crawling any target website to avoid legal issues or service disruptions. Excessive crawling can also lead to IP blocking or server overload. Proper scope definition is crucial to prevent unintended crawling of out-of-scope assets.

INTEGRATION WITH PROJECTDISCOVERY TOOLS

Katana is often used in conjunction with other ProjectDiscovery tools for comprehensive reconnaissance and vulnerability analysis. For instance, URLs discovered by Katana can be piped directly to httpx for active probe and status code checks, or to nuclei for automated vulnerability scanning. This creates a powerful, automated workflow for security assessments.

PASSIVE SOURCE DISCOVERY

Beyond active crawling, Katana can also perform passive source discovery using the -passive flag. This allows it to gather URLs from publicly available datasets like CommonCrawl and WayBack Machine, significantly expanding the target surface without actively hitting the target website.

HISTORY

Katana was developed by ProjectDiscovery, a team known for creating open-source tools for bug bounty hunters and penetration testers. It emerged as a modern alternative to older web crawlers, focusing on speed, efficiency, and the ability to handle contemporary web technologies like JavaScript-heavy applications. Its development leveraged Go's concurrency model for high performance, and it quickly gained popularity within the cybersecurity community due to its integration capabilities with other ProjectDiscovery tools like Nuclei and Httpx.