LinuxCommandLibrary

csv-diff

Compare two CSV files for differences

TLDR

Display a human-readable summary of differences between files using a specific column as a unique identifier

$ csv-diff [path/to/file1.csv] [path/to/file2.csv] --key [column_name]
copy

Display a human-readable summary of differences between files that includes unchanged values in rows with at least one change
$ csv-diff [path/to/file1.csv] [path/to/file2.csv] --key [column_name] --show-unchanged
copy

Display a summary of differences between files in JSON format using a specific column as a unique identifier
$ csv-diff [path/to/file1.csv] [path/to/file2.csv] --key [column_name] --json
copy

SYNOPSIS

csv-diff [OPTIONS] FILE1 FILE2

PARAMETERS

--key COLUMN
    Specifies the column to use as a key for identifying rows. Multiple --key options can be used to define a composite key.

--ignore-column COLUMN
    Specifies a column to ignore during the comparison.

--delimiter CHAR
    Specifies the delimiter character used in the CSV files (default is comma).

--skip-lines INTEGER
    Specifies the number of lines to skip at the beginning of each file.

--output FORMAT
    Specifies the output format. Available formats may include 'summary', 'diff', 'json', etc.

--version
    Show program's version number and exit.

--help
    Show help message and exit.

DESCRIPTION

csv-diff is a command-line utility designed to compare two CSV (Comma Separated Values) files and identify the differences between them. It provides various options to control the comparison process, including specifying key columns for identifying rows, handling different delimiters, and customizing the output format. The tool effectively highlights added, deleted, or modified rows based on the defined comparison criteria. It is a valuable tool for data validation, auditing changes in CSV datasets, and automating tasks involving CSV file comparisons. csv-diff is commonly used in data pipelines, version control systems for data files, and general data analysis workflows.

CAVEATS

csv-diff relies on consistent CSV formatting in the input files. Significant variations in formatting (e.g., different quoting styles) may lead to inaccurate results. Performance can degrade with very large CSV files.

OUTPUT FORMATS

Different output formats provide varying levels of detail.
summary: gives an overview of number of changes.
diff: Shows added, deleted and modified lines, using unified diff format.
json: Output the result in json format.

HISTORY

csv-diff's development likely emerged from the need to automate CSV file comparisons, a common task in data management and software development. The tool's popularity has grown with the increasing use of CSV as a data exchange format.

SEE ALSO

diff(1), cmp(1)

Copied to clipboard