isutf8
Check if file is valid UTF-8
TLDR
Check whether the specified files contain valid UTF-8
Print errors using multiple lines
Do not print anything to stdout, indicate the result merely with the exit code
Only print the names of the files containing invalid UTF-8
Same as --list but inverted, i.e., only print the names of the files containing valid UTF-8
SYNOPSIS
isutf8 [-c] [-h] [-V] [FILE ...]
PARAMETERS
-c
Quiet mode: suppress output, use exit status only.
-h, --help
Print usage help to stderr and exit.
-V, --version
Display version info and copyright, then exit.
DESCRIPTION
isutf8 is a lightweight command-line tool from the moreutils package that verifies if input data conforms to the UTF-8 encoding standard. It reads from standard input (stdin) by default or from specified files, scanning content line-by-line for valid UTF-8 byte sequences.
UTF-8 is the dominant character encoding for Unicode on Unix-like systems, but files can contain invalid sequences due to corruption, mixed encodings, or legacy data. isutf8 detects issues like overlong encodings, surrogate halves, or impossible bytes, making it ideal for data validation pipelines, script checks, or ensuring compatibility before processing with tools expecting UTF-8.
For each input (stdin or file), it outputs a simple status: "stdin:valid", "stdin:invalid", "file:valid", or "file:invalid". The exit code summarizes results: 0 for all valid, 1 for any invalid content, 2 for errors like unreadable files.
The -c option enables quiet mode for scripting, producing no output and relying solely on exit status. This utility excels in automation, such as validating log files, CSV imports, or web content before parsing.
CAVEATS
Processes line-by-line, potentially missing cross-line invalid sequences (rare in UTF-8). Not in coreutils; requires moreutils package. stdin named as 'stdin:' in output.
EXIT STATUS
0: all input valid.
1: invalid UTF-8 in any line.
2: I/O or other errors.
OUTPUT FORMAT
source:valid or source:invalid per input source (stdin or FILE), unless -c used.
HISTORY
Created by Joey Hess for the moreutils package (~2007), addressing missing standard tools for UTF-8 validation amid rising Unicode adoption.


