LinuxCommandLibrary

csvclean

Clean and normalize CSV files

TLDR

Clean a CSV file

$ csvclean [bad.csv]
copy

List locations of syntax errors in a CSV file
$ csvclean -n [bad.csv]
copy

SYNOPSIS

csvclean [options] [FILE]

PARAMETERS

-h, --help
    Show help message and exit.

-d DELIM, --delimiter DELIM
    Delimiter to use (default: comma).

-t, --tabs
    Use tabs as delimiter.

-q QUOTECHAR, --quotechar QUOTECHAR
    Quote character (default: ").

-u ESCAPECHAR, --escapechar ESCAPECHAR
    Escape character (default: none).

-z FIELD_SIZE, --maxfieldsize FIELD_SIZE
    Max field size in bytes (default: 131072).

-e ENCODING, --encoding ENCODING
    Input encoding (default: utf-8).

-H, --no-header-row
    Treat first row as data, not header.

-v, --verbose
    Print warnings about cleaned rows.

-l, --list-delimiters
    List known delimiters.

-V, --version
    Show csvkit version.

DESCRIPTION

csvclean is a command-line tool from the csvkit suite that fixes common formatting issues in CSV files, particularly rows with inconsistent numbers of fields. It reads the input CSV (from a file or stdin), uses the first row by default as the header to determine the expected column count, then processes each subsequent row: padding short rows with empty fields and truncating long rows to match. This ensures uniform structure suitable for further processing.

Key features include customizable delimiters, quote characters, encoding support, and verbose output for diagnostics. It handles quoting and escaping per RFC 4180 standards where possible, but warns on potential issues. Output is always to stdout, enabling easy piping to other tools like csvcut or csvlook.

Common use cases: cleaning exports from spreadsheets, web scrapers, or databases where data entry errors cause field mismatches. While effective for quick fixes, users should verify output as truncation can discard data. It's Python-based, efficient for moderate-sized files, and integrates seamlessly into shell scripts for ETL pipelines.

Not ideal for CSVs with multiline fields unless properly quoted, as it processes line-by-line.

CAVEATS

Truncates extra fields in long rows, causing data loss.
Assumes first row defines column count accurately.
Line-based; mishandles unquoted multiline fields.
Requires csvkit installed (not standard in Linux distros).

INSTALLATION

pip install csvkit or apt install csvkit (Ubuntu/Debian).

EXAMPLE

csvclean messy.csv > clean.csv

cat data.csv | csvclean -d ';' -v > fixed.csv

HISTORY

Part of csvkit, created in 2010 by Christopher Groskopf for journalistic data work. csvclean introduced early for preprocessing messy CSVs. Maintained on GitHub; v1.0+ stabilized in 2016 with Python 3 support.

SEE ALSO

csvcut(1), csvlook(1), csvstat(1), csvjoin(1), in2csv(1), xsv(1)

Copied to clipboard