LinuxCommandLibrary

isutf8

Check if file is valid UTF-8

TLDR

Check whether the specified files contain valid UTF-8

$ isutf8 [path/to/file1 path/to/file2 ...]
copy

Print errors using multiple lines
$ isutf8 [[-v|--verbose]] [path/to/file1 path/to/file2 ...]
copy

Do not print anything to stdout, indicate the result merely with the exit code
$ isutf8 [[-q|--quiet]] [path/to/file1 path/to/file2 ...]
copy

Only print the names of the files containing invalid UTF-8
$ isutf8 [[-l|--list]] [path/to/file1 path/to/file2 ...]
copy

Same as --list but inverted, i.e., only print the names of the files containing valid UTF-8
$ isutf8 [[-i|--invert]] [path/to/file1 path/to/file2 ...]
copy

SYNOPSIS

isutf8 [-c] [-h] [-V] [FILE ...]

PARAMETERS

-c
    Quiet mode: suppress output, use exit status only.

-h, --help
    Print usage help to stderr and exit.

-V, --version
    Display version info and copyright, then exit.

DESCRIPTION

isutf8 is a lightweight command-line tool from the moreutils package that verifies if input data conforms to the UTF-8 encoding standard. It reads from standard input (stdin) by default or from specified files, scanning content line-by-line for valid UTF-8 byte sequences.

UTF-8 is the dominant character encoding for Unicode on Unix-like systems, but files can contain invalid sequences due to corruption, mixed encodings, or legacy data. isutf8 detects issues like overlong encodings, surrogate halves, or impossible bytes, making it ideal for data validation pipelines, script checks, or ensuring compatibility before processing with tools expecting UTF-8.

For each input (stdin or file), it outputs a simple status: "stdin:valid", "stdin:invalid", "file:valid", or "file:invalid". The exit code summarizes results: 0 for all valid, 1 for any invalid content, 2 for errors like unreadable files.

The -c option enables quiet mode for scripting, producing no output and relying solely on exit status. This utility excels in automation, such as validating log files, CSV imports, or web content before parsing.

CAVEATS

Processes line-by-line, potentially missing cross-line invalid sequences (rare in UTF-8). Not in coreutils; requires moreutils package. stdin named as 'stdin:' in output.

EXIT STATUS

0: all input valid.
1: invalid UTF-8 in any line.
2: I/O or other errors.

OUTPUT FORMAT

source:valid or source:invalid per input source (stdin or FILE), unless -c used.

HISTORY

Created by Joey Hess for the moreutils package (~2007), addressing missing standard tools for UTF-8 validation amid rising Unicode adoption.

SEE ALSO

file(1), iconv(1), recode(1)

Copied to clipboard