csvclean
Clean and normalize CSV files
TLDR
Clean a CSV file
List locations of syntax errors in a CSV file
SYNOPSIS
csvclean [options] [FILE]
PARAMETERS
-h, --help
Show help message and exit.
-d DELIM, --delimiter DELIM
Delimiter to use (default: comma).
-t, --tabs
Use tabs as delimiter.
-q QUOTECHAR, --quotechar QUOTECHAR
Quote character (default: ").
-u ESCAPECHAR, --escapechar ESCAPECHAR
Escape character (default: none).
-z FIELD_SIZE, --maxfieldsize FIELD_SIZE
Max field size in bytes (default: 131072).
-e ENCODING, --encoding ENCODING
Input encoding (default: utf-8).
-H, --no-header-row
Treat first row as data, not header.
-v, --verbose
Print warnings about cleaned rows.
-l, --list-delimiters
List known delimiters.
-V, --version
Show csvkit version.
DESCRIPTION
csvclean is a command-line tool from the csvkit suite that fixes common formatting issues in CSV files, particularly rows with inconsistent numbers of fields. It reads the input CSV (from a file or stdin), uses the first row by default as the header to determine the expected column count, then processes each subsequent row: padding short rows with empty fields and truncating long rows to match. This ensures uniform structure suitable for further processing.
Key features include customizable delimiters, quote characters, encoding support, and verbose output for diagnostics. It handles quoting and escaping per RFC 4180 standards where possible, but warns on potential issues. Output is always to stdout, enabling easy piping to other tools like csvcut or csvlook.
Common use cases: cleaning exports from spreadsheets, web scrapers, or databases where data entry errors cause field mismatches. While effective for quick fixes, users should verify output as truncation can discard data. It's Python-based, efficient for moderate-sized files, and integrates seamlessly into shell scripts for ETL pipelines.
Not ideal for CSVs with multiline fields unless properly quoted, as it processes line-by-line.
CAVEATS
Truncates extra fields in long rows, causing data loss.
Assumes first row defines column count accurately.
Line-based; mishandles unquoted multiline fields.
Requires csvkit installed (not standard in Linux distros).
INSTALLATION
pip install csvkit or apt install csvkit (Ubuntu/Debian).
EXAMPLE
csvclean messy.csv > clean.csv
cat data.csv | csvclean -d ';' -v > fixed.csv
HISTORY
Part of csvkit, created in 2010 by Christopher Groskopf for journalistic data work. csvclean introduced early for preprocessing messy CSVs. Maintained on GitHub; v1.0+ stabilized in 2016 with Python 3 support.


