enca
Detect file character encoding
TLDR
Detect file(s) encoding according to the system's locale
Detect file(s) encoding specifying a language in the POSIX/C locale format (e.g. zh_CN, en_US)
Convert file(s) to a specific encoding
Create a copy of an existing file using a different encoding
SYNOPSIS
enca [options] [files...]
enca -L language [-x encoding] [files...]
PARAMETERS
-h, --help
Display help and exit
-V, --version
Output version information and exit
-L lang, --language=lang
Assume text is in language lang (e.g., "czech", "pl"); required for best accuracy
-l [lang], --list[=lang]
List supported languages or encodings for lang
-L list, --list-languages
List all supported languages
-s scheme, --scheme=scheme
Select guessing scheme (e.g., "status", "slim")
-S, --list-schemes
List available guessing schemes
-g, --guess
Guess if all files share encoding (experimental)
--no-guessing
Disable automatic guessing
-i, --ignore-binary
Ignore binary-looking files
-u, --unicode
Assume Unicode input
-X prog, --ignore-prog=prog
Use program prog to decide ignorable files
-x enc, --convert-to=enc
Convert files to encoding enc (e.g., "UTF-8")
-v, --verbose
Increase verbosity
-C dir, --cpath=dir
Set path to charset conversion tables
--ignore-garbage
Ignore invalid sequences during conversion
--colour, --color
Use colors in output
--dump-model
Dump language model for debugging
DESCRIPTION
Enca (Ensemble N Character Analysis) is a powerful command-line utility for detecting the character encodings of natural language text. It employs statistical methods, language models, and heuristics to identify encodings like ISO-8859, Windows codepages, KOI8, and many others across over 60 languages. Beyond detection, enca can convert files to a specified encoding, making it essential for processing legacy data, multilingual scripts, or mixed-encoding directories.
Usage involves specifying the language with -L for accurate detection, as encoding guesses rely on language-specific n-gram models. It processes files sequentially, outputting the detected encoding (e.g., "ISO-8859-2") or converting in place/non-interactively. Binary files are typically ignored, and it supports batch operations. A companion tool, enca-do, handles recursive processing with exclusions.
Enca excels in accuracy for longer texts but may falter on very short or ambiguous content. It's lightweight, portable, and integrates well into pipelines with tools like iconv. Developed for Unix-like systems, it's particularly useful in data migration, web scraping cleanup, or email archive handling where encoding mismatches cause mojibake.
CAVEATS
Accuracy drops for short texts (<100 chars) or mixed encodings; always specify language with -L; no support for right-to-left scripts; conversion may lose data if irreversible.
EXAMPLES
enca -L none file.txt
Detect encoding without language guess.
enca -L polish -x UTF-8 *.txt
Convert all Polish text files to UTF-8.
enca-do -L czech file
Recursively process directory with Czech language.
ENCA-DO
Companion for recursive operations: enca-do [options] dirs... Supports -ignore patterns, --backup, and integrates enca logic.
HISTORY
Developed by David Nečas (Yenjoe) starting in 2002; initial release 1.0 in 2003. Maintained sporadically; latest stable 1.19 (2014). Focuses on C implementation for speed and portability across Unix-like systems.


