LinuxCommandLibrary

enca

Detect file character encoding

TLDR

Detect file(s) encoding according to the system's locale

$ enca [path/to/file1 path/to/file2 ...]
copy

Detect file(s) encoding specifying a language in the POSIX/C locale format (e.g. zh_CN, en_US)
$ enca [[-L|--language]] [language] [path/to/file1 path/to/file2 ...]
copy

Convert file(s) to a specific encoding
$ enca [[-L|--language]] [language] [[-x|--convert-to]] [to_encoding] [path/to/file1 path/to/file2 ...]
copy

Create a copy of an existing file using a different encoding
$ enca < [original_file] [[-L|--language]] [language] [[-x|--convert-to]] [to_encoding] > [new_file]
copy

SYNOPSIS

enca [options] [files...]
enca -L language [-x encoding] [files...]

PARAMETERS

-h, --help
    Display help and exit

-V, --version
    Output version information and exit

-L lang, --language=lang
    Assume text is in language lang (e.g., "czech", "pl"); required for best accuracy

-l [lang], --list[=lang]
    List supported languages or encodings for lang

-L list, --list-languages
    List all supported languages

-s scheme, --scheme=scheme
    Select guessing scheme (e.g., "status", "slim")

-S, --list-schemes
    List available guessing schemes

-g, --guess
    Guess if all files share encoding (experimental)

--no-guessing
    Disable automatic guessing

-i, --ignore-binary
    Ignore binary-looking files

-u, --unicode
    Assume Unicode input

-X prog, --ignore-prog=prog
    Use program prog to decide ignorable files

-x enc, --convert-to=enc
    Convert files to encoding enc (e.g., "UTF-8")

-v, --verbose
    Increase verbosity

-C dir, --cpath=dir
    Set path to charset conversion tables

--ignore-garbage
    Ignore invalid sequences during conversion

--colour, --color
    Use colors in output

--dump-model
    Dump language model for debugging

DESCRIPTION

Enca (Ensemble N Character Analysis) is a powerful command-line utility for detecting the character encodings of natural language text. It employs statistical methods, language models, and heuristics to identify encodings like ISO-8859, Windows codepages, KOI8, and many others across over 60 languages. Beyond detection, enca can convert files to a specified encoding, making it essential for processing legacy data, multilingual scripts, or mixed-encoding directories.

Usage involves specifying the language with -L for accurate detection, as encoding guesses rely on language-specific n-gram models. It processes files sequentially, outputting the detected encoding (e.g., "ISO-8859-2") or converting in place/non-interactively. Binary files are typically ignored, and it supports batch operations. A companion tool, enca-do, handles recursive processing with exclusions.

Enca excels in accuracy for longer texts but may falter on very short or ambiguous content. It's lightweight, portable, and integrates well into pipelines with tools like iconv. Developed for Unix-like systems, it's particularly useful in data migration, web scraping cleanup, or email archive handling where encoding mismatches cause mojibake.

CAVEATS

Accuracy drops for short texts (<100 chars) or mixed encodings; always specify language with -L; no support for right-to-left scripts; conversion may lose data if irreversible.

EXAMPLES

enca -L none file.txt
Detect encoding without language guess.

enca -L polish -x UTF-8 *.txt
Convert all Polish text files to UTF-8.

enca-do -L czech file
Recursively process directory with Czech language.

ENCA-DO

Companion for recursive operations: enca-do [options] dirs... Supports -ignore patterns, --backup, and integrates enca logic.

HISTORY

Developed by David Nečas (Yenjoe) starting in 2002; initial release 1.0 in 2003. Maintained sporadically; latest stable 1.19 (2014). Focuses on C implementation for speed and portability across Unix-like systems.

SEE ALSO

file(1), iconv(1), recode(1), chardetect(1)

Copied to clipboard