enca

Detect file character encoding

TLDR

Detect file(s) encoding according to the system's locale

$ enca [path/to/file1 path/to/file2 ...]

Detect file(s) encoding specifying a language in the POSIX/C locale format (e.g. zh_CN, en_US)

$ enca [[-L|--language]] [language] [path/to/file1 path/to/file2 ...]

Convert file(s) to a specific encoding

$ enca [[-L|--language]] [language] [[-x|--convert-to]] [to_encoding] [path/to/file1 path/to/file2 ...]

Create a copy of an existing file using a different encoding

$ enca [[-L|--language]] [language] [[-x|--convert-to]] [to_encoding] < [original_file] > [new_file]

SYNOPSIS

enca [OPTIONS] [FILE...]
enca -L language [OPTIONS] [FILE...]
enca -x encoding [OPTIONS] [FILE...]

-L language
    Specify the language of the input file. This helps enca with more accurate detection. Examples include en (English), cs (Czech), sk (Slovak), ru (Russian), pl (Polish), uk (Ukrainian).

-x encoding
    Convert to the specified encoding. Common examples are utf-8, iso-8859-2, windows-1250. This option implicitly converts the file.

-c
    Check and report encoding without converting. This is often the default behavior if no conversion option like -x is used.

-d
    Display detailed information about the detection process, including statistical data.

-m
    Modify files in place. This option requires the -x (conversion) option and should be used with caution, as it overwrites the original file.

-t
    Convert to Unicode (UTF-8 by default). This is a shortcut for -x utf-8.

-g
    Guess encoding only. Output only the guessed encoding name, without any other information or conversion.

--help
    Display a help message with available options and exit.

--version
    Display version information about enca and exit.

DESCRIPTION

enca is a command-line tool designed for guessing and converting character encodings of text files. It attempts to determine the encoding of a given text file using statistical analysis and various heuristics, including checking byte patterns, character frequencies, and common language constructs. Once detected, enca can report the encoding or convert the file to a different specified encoding (e.g., UTF-8). It is known for its strong performance in detecting many encodings, including a wide range of East European and Cyrillic encodings, as well as common ones like UTF-8, UTF-16, ISO-8859 variants, and Windows code pages. It's a useful tool for managing text files from various sources or legacy systems, especially when dealing with multilingual content.

CAVEATS

Due to its heuristic nature, enca's detection is not always 100% accurate, especially for short files or files with ambiguous character sets. It can sometimes misidentify encodings; using the -L option for specific languages can significantly improve accuracy. When using the -m (modify in place) option, it's highly recommended to back up important files beforehand, as incorrect detection or conversion can lead to data corruption.

LANGUAGE-SPECIFIC DETECTION

enca performs significantly better when it knows the language of the text. Using the -L language option (e.g., -L cs for Czech) can drastically improve detection accuracy, as it allows enca to apply language-specific patterns and statistics. This is particularly useful for East European and Cyrillic languages where character set variations are common.

IN-PLACE MODIFICATION CAUTION

While enca offers the convenience of modifying files in place with the -m option, users should exercise extreme caution. Incorrect detection followed by in-place conversion can permanently corrupt the original file. It is a best practice to always test conversions on copies of files or to ensure you have a backup before performing in-place modifications.

HISTORY

enca was developed by David Necas (Yeti) to address the need for robust character set detection and conversion, particularly for a broad range of encodings beyond just the common ones. Its development focused on implementing a strong heuristic approach that allows it to infer encodings even when explicit byte order marks (BOMs) or other metadata are missing. It fills a critical niche for users dealing with diverse text files originating from various systems, especially those with non-Western European or legacy character sets, by providing a powerful tool for charset interoperability.