csv-diff
Compare two CSV files for differences
TLDR
Display a human-readable summary of differences between files using a specific column as a unique identifier
Display a human-readable summary of differences between files that includes unchanged values in rows with at least one change
Display a summary of differences between files in JSON format using a specific column as a unique identifier
SYNOPSIS
csv-diff [options] file1.csv file2.csv
PARAMETERS
--key, -k COLUMN
Specifies one or more columns to use as primary keys for matching rows. Without a key, csv-diff may compare rows based on all columns or the first column.
--columns, -c COLUMN
Specifies one or more columns to compare for differences. If omitted, all columns are compared.
--no-header-row
Treats the first row of each CSV file as data rather than column headers.
--unified, -u
Outputs differences in a unified diff format, similar to diff -u.
--raw, -r
Outputs differences in a raw, machine-readable format, often JSON.
--ignore-case, -I
Performs case-insensitive comparisons for string values.
--added-rows-only
Limits the output to only rows that exist in file2.csv but not in file1.csv.
--removed-rows-only
Limits the output to only rows that exist in file1.csv but not in file2.csv.
--changed-rows-only
Limits the output to only rows that have matching keys but differing values in other columns.
DESCRIPTION
The csv-diff command is a specialized utility designed to compare two Comma Separated Value (CSV) files and identify their differences.
It is particularly useful for tracking changes in datasets, verifying data migrations, or comparing database exports.
Unlike generic text diff tools, csv-diff understands the tabular structure of CSV files, allowing it to intelligently match rows based on specified key columns and then report discrepancies across other columns.
It can distinguish between added rows, removed rows, and modified rows, often providing a clear, human-readable output that highlights the exact cells that have changed.
This intelligent comparison makes it superior to line-by-line diffs for structured data, as it can correctly align rows even if their order changes or if non-key columns are modified.
CAVEATS
- Key Column Importance: The accuracy of csv-diff heavily relies on correctly identifying unique key columns. If no unique key exists or is specified, results may be unreliable or misleading.
- Performance with Large Files: Comparing extremely large CSV files can be memory-intensive and slow, as the command may need to load entire datasets into memory for efficient comparison.
- Order and Duplicates: While key columns help, duplicate keys within a file or inconsistent row ordering (if no key is used) can complicate diffing.
- Data Type Coercion: Some implementations might perform implicit data type coercion during comparison, which can lead to unexpected diffs if types are not perfectly aligned (e.g., "1" vs 1).
USE CASES FOR DIFFERENT OUTPUT FORMATS
The default human-readable output is excellent for quick visual inspection.
The --unified format is suitable for integration with version control systems or traditional diff tools.
The --raw (often JSON) output is ideal for programmatic consumption, allowing other scripts or applications to process the detected differences.
HANDLING MISSING COLUMNS
By default, if a column exists in one file but not the other, it might be reported as a difference or ignored depending on the specific implementation and options. Tools like `csvdiff` from `csvkit` generally handle this gracefully, sometimes requiring an explicit `--fill-value` to treat missing values as a specific placeholder.
HISTORY
The csv-diff command is prominently featured as part of csvkit, a powerful suite of command-line tools for converting to and working with CSV. Developed primarily in Python, csvkit and its `csvdiff` component have become a popular choice for data analysts and developers needing to quickly manipulate and compare CSV data without programming. Its development reflects the growing need for specialized tools to handle semi-structured data formats like CSV more intelligently than traditional text utilities.