LinuxCommandLibrary

csv-diff

Compare two CSV files for differences

TLDR

Display a human-readable summary of differences between files using a specific column as a unique identifier

$ csv-diff [path/to/file1.csv] [path/to/file2.csv] --key [column_name]
copy

Display a human-readable summary of differences between files that includes unchanged values in rows with at least one change
$ csv-diff [path/to/file1.csv] [path/to/file2.csv] --key [column_name] --show-unchanged
copy

Display a summary of differences between files in JSON format using a specific column as a unique identifier
$ csv-diff [path/to/file1.csv] [path/to/file2.csv] --key [column_name] --json
copy

SYNOPSIS

csv-diff [options] file1.csv file2.csv

PARAMETERS

--key, -k COLUMN
    Specifies one or more columns to use as primary keys for matching rows. Without a key, csv-diff may compare rows based on all columns or the first column.

--columns, -c COLUMN
    Specifies one or more columns to compare for differences. If omitted, all columns are compared.

--no-header-row
    Treats the first row of each CSV file as data rather than column headers.

--unified, -u
    Outputs differences in a unified diff format, similar to diff -u.

--raw, -r
    Outputs differences in a raw, machine-readable format, often JSON.

--ignore-case, -I
    Performs case-insensitive comparisons for string values.

--added-rows-only
    Limits the output to only rows that exist in file2.csv but not in file1.csv.

--removed-rows-only
    Limits the output to only rows that exist in file1.csv but not in file2.csv.

--changed-rows-only
    Limits the output to only rows that have matching keys but differing values in other columns.

DESCRIPTION

The csv-diff command is a specialized utility designed to compare two Comma Separated Value (CSV) files and identify their differences.
It is particularly useful for tracking changes in datasets, verifying data migrations, or comparing database exports.
Unlike generic text diff tools, csv-diff understands the tabular structure of CSV files, allowing it to intelligently match rows based on specified key columns and then report discrepancies across other columns.
It can distinguish between added rows, removed rows, and modified rows, often providing a clear, human-readable output that highlights the exact cells that have changed.
This intelligent comparison makes it superior to line-by-line diffs for structured data, as it can correctly align rows even if their order changes or if non-key columns are modified.

CAVEATS

  • Key Column Importance: The accuracy of csv-diff heavily relies on correctly identifying unique key columns. If no unique key exists or is specified, results may be unreliable or misleading.
  • Performance with Large Files: Comparing extremely large CSV files can be memory-intensive and slow, as the command may need to load entire datasets into memory for efficient comparison.
  • Order and Duplicates: While key columns help, duplicate keys within a file or inconsistent row ordering (if no key is used) can complicate diffing.
  • Data Type Coercion: Some implementations might perform implicit data type coercion during comparison, which can lead to unexpected diffs if types are not perfectly aligned (e.g., "1" vs 1).

USE CASES FOR DIFFERENT OUTPUT FORMATS

The default human-readable output is excellent for quick visual inspection.
The --unified format is suitable for integration with version control systems or traditional diff tools.
The --raw (often JSON) output is ideal for programmatic consumption, allowing other scripts or applications to process the detected differences.

HANDLING MISSING COLUMNS

By default, if a column exists in one file but not the other, it might be reported as a difference or ignored depending on the specific implementation and options. Tools like `csvdiff` from `csvkit` generally handle this gracefully, sometimes requiring an explicit `--fill-value` to treat missing values as a specific placeholder.

HISTORY

The csv-diff command is prominently featured as part of csvkit, a powerful suite of command-line tools for converting to and working with CSV. Developed primarily in Python, csvkit and its `csvdiff` component have become a popular choice for data analysts and developers needing to quickly manipulate and compare CSV data without programming. Its development reflects the growing need for specialized tools to handle semi-structured data formats like CSV more intelligently than traditional text utilities.

SEE ALSO

diff(1), comm(1), csvcut(1), csvsort(1), csvstack(1)

Copied to clipboard