LinuxCommandLibrary

csvclean

Clean and normalize CSV files

TLDR

Clean a CSV file

$ csvclean [bad.csv]
copy

List locations of syntax errors in a CSV file
$ csvclean -n [bad.csv]
copy

SYNOPSIS

csvclean [OPTIONS] [FILE]
or
csvclean [OPTIONS] -

PARAMETERS

-n, --no-header-row
    Assume no header row exists in the input file.

-p, --process-all-columns
    Process rows with too many or too few columns by truncating or padding.

-e, --skip-errors
    Skip rows that cause parsing errors without reporting them to stderr.

-I, --no-inference
    Disable type inference for faster processing, treating all data as strings.

-r, --max-rows
    Stop processing after reading N rows from the input.

-c, --columns
    A comma-separated list of column names or indices to keep in the output.

-C, --exclude-columns
    A comma-separated list of column names or indices to exclude from the output.

--fill-missing-columns
    Add missing columns to rows to match the header, filling with --fill-value.

--fill-value
    The value to use when filling missing columns (default is empty string).

--truncate-excess-columns
    Remove excess columns from rows that have more columns than the header.

-H, --help
    Show the help message and exit.

-V, --version
    Show the version number and exit.

DESCRIPTION

csvclean is a powerful utility from the csvkit suite designed to identify and fix common structural issues in CSV (Comma Separated Values) files. It plays a crucial role in data preprocessing by ensuring that CSV data adheres to a consistent format, making it suitable for further analysis or database ingestion. The command's primary function is to parse input CSV files, identify problematic rows—such as those with an incorrect number of columns (either too many or too few), completely blank rows, or rows that fail to parse due to malformed data. By default, csvclean outputs the corrected, valid rows to standard output (stdout) and writes any problematic, malformed rows to standard error (stderr), optionally with line numbers, allowing users to inspect and address the errors separately. This separation of valid and invalid data streams makes csvclean an invaluable tool for robust data pipeline automation and manual data quality control. It can also handle various CSV dialects and character encodings, making it versatile for diverse datasets.

CAVEATS

csvclean primarily focuses on structural integrity (e.g., column count consistency) rather than validating data content or semantic correctness. By default, problematic rows are sent to stderr, which might require redirection to capture for analysis. Processing very large files without options like --max-rows can be memory-intensive. It relies on csvkit's parsing engine, which, while robust, may struggle with extremely malformed or non-standard CSV dialects.

<B>STANDARD INPUT/OUTPUT</B>

csvclean can read from standard input if FILE is omitted or specified as '-', and writes cleaned data to standard output. This allows it to be easily integrated into shell pipelines with other Unix commands for chained processing.

<B>ERROR REPORTING</B>

For rows that csvclean deems problematic, it outputs them to stderr along with line numbers and a descriptive error message (e.g., 'Expected 3 columns, found 4'). This detailed error reporting is crucial for debugging and understanding data quality issues.

<B>COLUMN NORMALIZATION</B>

When using --process-all-columns in conjunction with --fill-missing-columns or --truncate-excess-columns, csvclean can enforce a consistent column count across all rows. This feature is often a prerequisite for loading data into databases, structured data frames, or for consistent machine learning input.

HISTORY

csvclean is an integral part of the csvkit suite, an open-source collection of command-line tools for converting to and working with CSV. csvkit was initially developed by Christopher Groskopf and released around 2012, gaining popularity for its powerful, yet simple, approach to CSV manipulation using standard Unix pipes. csvclean was conceived as a necessary utility within this suite to address the common problem of inconsistent or malformed CSV data, enabling seamless integration of diverse datasets into analytical workflows. Its development has focused on robustness and adherence to the CSV standard, making it a reliable tool for data quality in various data science and engineering tasks.

SEE ALSO

csvcut(1), csvgrep(1), csvsort(1), csvstat(1), csvstack(1), awk(1), sed(1)

Copied to clipboard