chardet

Detect character encoding of a file

PARAMETERS

--help
Display help message and exit

--version
Print program version and exit

chardetect is a command-line utility from the Python chardet library, used to identify the character encoding of files or standard input. It employs statistical analysis of byte distributions to guess encodings like UTF-8, ISO-8859-1, or GB2312, outputting the detected encoding and a confidence score (0.0 to 1.0).

Ideal for processing files with unknown encodings, such as scraped web content, emails, or legacy data. Run it on one or more files: chardetect file1.txt file2.txt. Output format: filename: encoding with confidence 0.99. Without arguments, it processes stdin, useful in pipes like cat file | chardetect.

The tool shines with multilingual text but is heuristic-based, so results aren't guaranteed. Confidence helps assess reliability—low scores suggest ambiguity. Widely used in data pipelines, ETL processes, and scripting for robust text handling across encodings.

CAVEATS

Heuristic detection may err on short texts, mixed encodings, or binary data; always verify with confidence score. Not suitable for non-text files.

EXAMPLE USAGE

chardetect document.txt
document.txt: utf-8 with confidence 0.99

curl -s http://example.com | chardetect
/: iso-8859-1 with confidence 0.73

INSTALLATION

Debian/Ubuntu: apt install python3-chardet
Fedora: dnf install python3-chardet
Provides /usr/bin/chardetect.

HISTORY

Developed by Mark Pilgrim in 2002 as part of Python FeedParser. Became standalone chardet library around 2010. CLI tool chardetect bundled with it; now at version 5.x, Python 3 compatible, actively maintained on GitHub.