chardet
Detect character encoding of a file
SYNOPSIS
chardetect [file ...]
PARAMETERS
--help
Display help message and exit
--version
Print program version and exit
DESCRIPTION
chardetect is a command-line utility from the Python chardet library, used to identify the character encoding of files or standard input. It employs statistical analysis of byte distributions to guess encodings like UTF-8, ISO-8859-1, or GB2312, outputting the detected encoding and a confidence score (0.0 to 1.0).
Ideal for processing files with unknown encodings, such as scraped web content, emails, or legacy data. Run it on one or more files: chardetect file1.txt file2.txt. Output format: filename: encoding with confidence 0.99. Without arguments, it processes stdin, useful in pipes like cat file | chardetect.
The tool shines with multilingual text but is heuristic-based, so results aren't guaranteed. Confidence helps assess reliability—low scores suggest ambiguity. Widely used in data pipelines, ETL processes, and scripting for robust text handling across encodings.
CAVEATS
Heuristic detection may err on short texts, mixed encodings, or binary data; always verify with confidence score. Not suitable for non-text files.
EXAMPLE USAGE
chardetect document.txt
document.txt: utf-8 with confidence 0.99
curl -s http://example.com | chardetect
/: iso-8859-1 with confidence 0.73
INSTALLATION
Debian/Ubuntu: apt install python3-chardet
Fedora: dnf install python3-chardet
Provides /usr/bin/chardetect.
HISTORY
Developed by Mark Pilgrim in 2002 as part of Python FeedParser. Became standalone chardet library around 2010. CLI tool chardetect bundled with it; now at version 5.x, Python 3 compatible, actively maintained on GitHub.


