isutf8
Check if file is valid UTF-8
TLDR
Check whether the specified files contain valid UTF-8
Print errors using multiple lines
Do not print anything to stdout, indicate the result merely with the exit code
Only print the names of the files containing invalid UTF-8
Same as --list but inverted, i.e., only print the names of the files containing valid UTF-8
SYNOPSIS
isutf8 [-q|--quiet] [-v|--verbose] [-h|--help] [--version] [file...]
PARAMETERS
-q, --quiet
Suppresses all output to standard output. The command's success or failure is indicated solely by its exit status (0 for UTF-8, 1 for non-UTF-8).
-v, --verbose
When provided with file arguments, prints the names of files that are not valid UTF-8 to standard output. If standard input is checked and is not UTF-8, it prints "(standard input)".
-h, --help
Displays a brief help message describing the command's usage and options, then exits.
--version
Prints the version information of the isutf8 utility, then exits.
file...
One or more paths to files to be checked for UTF-8 validity. If no files are specified, isutf8 reads from standard input.
DESCRIPTION
isutf8 is a command-line utility designed to verify if a given input stream or one or more specified files contain valid UTF-8 encoded text. It reads the input byte by byte and checks for compliance with UTF-8 encoding rules. The command is particularly useful in scripting and automation where the processing of files might depend on their character encoding.
If the input is valid UTF-8, isutf8 exits with a status of 0; otherwise, it exits with 1. This exit status is its primary output for scripting purposes. It's often included as part of the moreutils package, a collection of handy Unix tools that are not standard but prove highly useful for everyday tasks. Users can pipe output into isutf8 or provide filenames as arguments.
CAVEATS
isutf8 only checks for the validity of UTF-8 byte sequences, not for semantic correctness or specific Unicode characters. It does not perform character set conversion; for that, iconv is needed.
It might not differentiate between completely malformed files and files encoded in other character sets (e.g., Latin-1), only that they are not valid UTF-8. Its performance on very large files might be a consideration, as it processes the entire file.
EXIT STATUS
isutf8 communicates its results primarily through its exit status, making it highly suitable for use in shell scripts.
0: All provided inputs (files or standard input) are valid UTF-8.
1: At least one input was found to be not valid UTF-8.
2: Indicates a usage error, such as invalid options or arguments.
TYPICAL USAGE
isutf8 is commonly used in conditional statements within shell scripts to ensure that text files are processed correctly based on their encoding.
Example:if isutf8 my_document.txt; then
echo "my_document.txt is UTF-8."
else
echo "my_document.txt is NOT UTF-8."
fi
HISTORY
isutf8 is part of the moreutils package, a collection of small, useful Unix utilities that extend the standard set of tools. moreutils was created by Joey Hess, and isutf8 specifically fills a common need for easily checking character encoding validity in shell scripts without relying on more complex solutions or external libraries. Its inclusion in moreutils has made it readily available to many Linux and Unix users seeking practical solutions for text processing tasks.