LinuxCommandLibrary

isutf8

Check if file is valid UTF-8

TLDR

Check whether the specified files contain valid UTF-8

$ isutf8 [path/to/file1 path/to/file2 ...]
copy

Print errors using multiple lines
$ isutf8 [[-v|--verbose]] [path/to/file1 path/to/file2 ...]
copy

Do not print anything to stdout, indicate the result merely with the exit code
$ isutf8 [[-q|--quiet]] [path/to/file1 path/to/file2 ...]
copy

Only print the names of the files containing invalid UTF-8
$ isutf8 [[-l|--list]] [path/to/file1 path/to/file2 ...]
copy

Same as --list but inverted, i.e., only print the names of the files containing valid UTF-8
$ isutf8 [[-i|--invert]] [path/to/file1 path/to/file2 ...]
copy

SYNOPSIS

isutf8 [-q|--quiet] [-v|--verbose] [-h|--help] [--version] [file...]

PARAMETERS

-q, --quiet
    Suppresses all output to standard output. The command's success or failure is indicated solely by its exit status (0 for UTF-8, 1 for non-UTF-8).

-v, --verbose
    When provided with file arguments, prints the names of files that are not valid UTF-8 to standard output. If standard input is checked and is not UTF-8, it prints "(standard input)".

-h, --help
    Displays a brief help message describing the command's usage and options, then exits.

--version
    Prints the version information of the isutf8 utility, then exits.

file...
    One or more paths to files to be checked for UTF-8 validity. If no files are specified, isutf8 reads from standard input.

DESCRIPTION

isutf8 is a command-line utility designed to verify if a given input stream or one or more specified files contain valid UTF-8 encoded text. It reads the input byte by byte and checks for compliance with UTF-8 encoding rules. The command is particularly useful in scripting and automation where the processing of files might depend on their character encoding.

If the input is valid UTF-8, isutf8 exits with a status of 0; otherwise, it exits with 1. This exit status is its primary output for scripting purposes. It's often included as part of the moreutils package, a collection of handy Unix tools that are not standard but prove highly useful for everyday tasks. Users can pipe output into isutf8 or provide filenames as arguments.

CAVEATS

isutf8 only checks for the validity of UTF-8 byte sequences, not for semantic correctness or specific Unicode characters. It does not perform character set conversion; for that, iconv is needed.

It might not differentiate between completely malformed files and files encoded in other character sets (e.g., Latin-1), only that they are not valid UTF-8. Its performance on very large files might be a consideration, as it processes the entire file.

EXIT STATUS

isutf8 communicates its results primarily through its exit status, making it highly suitable for use in shell scripts.
0: All provided inputs (files or standard input) are valid UTF-8.
1: At least one input was found to be not valid UTF-8.
2: Indicates a usage error, such as invalid options or arguments.

TYPICAL USAGE

isutf8 is commonly used in conditional statements within shell scripts to ensure that text files are processed correctly based on their encoding.
Example:
if isutf8 my_document.txt; then
echo "my_document.txt is UTF-8."
else
echo "my_document.txt is NOT UTF-8."
fi

HISTORY

isutf8 is part of the moreutils package, a collection of small, useful Unix utilities that extend the standard set of tools. moreutils was created by Joey Hess, and isutf8 specifically fills a common need for easily checking character encoding validity in shell scripts without relying on more complex solutions or external libraries. Its inclusion in moreutils has made it readily available to many Linux and Unix users seeking practical solutions for text processing tasks.

SEE ALSO

file(1), iconv(1), locale(1), grep(1), recode(1)

Copied to clipboard