cewl

Spider websites to generate wordlists

TLDR

Create a wordlist file from the given URL up to 2 links depth

$ cewl [[-d|--depth]] 2 [[-w|--write]] [path/to/wordlist.txt] [url]

Output an alphanumeric wordlist from the given URL with words of minimum 5 characters

$ cewl --with-numbers [[-m|--min_word_length]] 5 [url]

Output a wordlist from the given URL in debug mode including email addresses

$ cewl --debug [[-e|--email]] [url]

Output a wordlist from the given URL using HTTP Basic or Digest authentication

$ cewl --auth_type [basic|digest] --auth_user [username] --auth_pass [password] [url]

Output a wordlist from the given URL through a proxy

$ cewl --proxy_host [host] --proxy_port [port] [url]

-d, --depth <num>
    Specifies the depth to crawl beyond the initial URL. Default is 2.

-m, --min_length <num>
    Sets the minimum length a word must have to be included in the list. Default is 3.

-o, --offsite
    Allows cewl to follow offsite links during crawling. By default, it stays on the initial domain.

-w, --write <file>
    Writes the generated word list to the specified file instead of standard output.

-u, --ua <agent>
    Sets the User-Agent string to be used for HTTP requests, useful for bypassing some WAFs or for stealth.

-H, --auth_header <string>
    Adds an additional HTTP authentication header (e.g., for Basic Auth, Bearer tokens).

--data <string>
    Sends POST data with requests, useful for crawling sites that require form submissions.

--no-elements
    Excludes words found within HTML element tags (e.g., <div>, <span>).

--lowercase
    Converts all extracted words to lowercase.

--count
    Displays the count of each unique word found, useful for statistical analysis.

--exclude <file>
    Provides a file containing words to exclude from the final word list.

--meta
    Extracts words from document metadata (e.g., PDF, DOC files) found on the site.

--email
    Extracts and lists email addresses found on the target website.

--proxy <host:port>
    Routes HTTP requests through a specified proxy server.

--ssl-no-verify
    Disables SSL certificate verification, useful for sites with self-signed certificates or for local proxying.

--timeout <num>
    Sets the connection timeout in seconds.

DESCRIPTION

cewl (Custom Word List) is a Ruby-based command-line tool designed to create custom word lists by crawling target URLs. It extracts unique words from the HTML content of the specified website and its sub-pages (up to a defined depth). These generated word lists are commonly used in penetration testing, particularly for dictionary attacks against password-protected systems, where the assumption is that users might use words related to their organization or website content as passwords. It can also extract email addresses and metadata from documents found on the site, further enriching potential attack vectors.

CAVEATS

Using cewl for unauthorized access to systems or data is illegal and unethical. It should only be used on systems where explicit permission has been granted, or for personal educational purposes on your own controlled environments. Be mindful of the load you place on target servers, as extensive crawling can be mistaken for a denial-of-service attack.

ETHICAL CONSIDERATIONS

Always obtain explicit written permission before using cewl against any third-party website or system. Unauthorized use can lead to legal consequences. It is a powerful reconnaissance tool that can be misused for malicious purposes.

TYPICAL WORKFLOW

A common workflow involves using cewl to generate a custom wordlist from a target's public-facing website. This wordlist is then fed into password cracking tools like John the Ripper or Hashcat to attempt to brute-force credentials on services like SSH, FTP, or web logins. The idea is that employees might use easily guessable passwords derived from company-specific terms.

HISTORY

cewl was originally developed by Robin Wood (digininja) as a Ruby script. It gained popularity within the penetration testing community for its simple yet effective approach to generating context-specific wordlists. It's often included in penetration testing distributions like Kali Linux, highlighting its utility in the security domain. Its development has focused on enhancing crawling capabilities, parsing options, and network handling.