dvc-diff
Show changes between DVC tracked data versions
TLDR
Compare DVC tracked files from different Git commits, tags, and branches w.r.t the current workspace
Compare the changes in DVC tracked files from 1 Git commit to another
Compare DVC tracked files, along with their latest hash
Compare DVC tracked files, displaying the output as JSON
Compare DVC tracked files, displaying the output as Markdown
SYNOPSIS
dvc diff [old_revision] [new_revision] [options]
dvc diff [--old old_revision] [--new new_revision] [options] [paths...]
PARAMETERS
old_revision
The old revision (Git commit, branch, or tag) to compare from. Defaults to the HEAD commit.
new_revision
The new revision (Git commit, branch, or tag) to compare to. Defaults to the current workspace.
-h, --help
Show the help message and exit.
-q, --quiet
Do not print anything to standard output. Useful for scripting.
-v, --verbose
Print detailed information for debugging.
--data
Show changes in DVC-tracked data and directories. This is the default behavior.
--metrics
Show changes in DVC-tracked metrics files.
--plots
Show changes in DVC-tracked plot files.
--json
Output the differences in JSON format. Applicable for metrics and plots.
--tsv
Output metrics/plots differences in TSV format.
--md
Output metrics/plots differences in Markdown table format.
paths...
One or more specific DVC-tracked paths to diff. If not provided, all tracked items are diffed.
DESCRIPTION
dvc-diff is a powerful command within the Data Version Control (DVC) framework, designed to reveal differences in DVC-tracked files, directories, metrics, and plots between various states of your repository. Similar to how git diff compares source code, dvc diff specializes in comparing data artifacts and the results of machine learning experiments.
It can compare two arbitrary revisions (commits, branches, or tags) or compare a revision with the current workspace. By default, it shows changes in DVC-tracked data files and directories, indicating whether they are added, deleted, or modified (based on DVC's content-addressing).
Beyond just data, dvc-diff can also highlight differences in metrics files (e.g., accuracy, loss values) and plot data, making it an invaluable tool for tracking experiment progress and understanding the impact of changes in data or code on model performance.
The command leverages Git for revision management, allowing users to easily compare different versions of their data science projects.
CAVEATS
dvc-diff relies on DVC's metadata (.dvc files) and Git's revision history. If .dvc files are not committed or if the repository is not a Git repository, the command may not function as expected.
For very large datasets, calculating and displaying differences can be time-consuming, as DVC may need to compare hashes or even access remote storage.
The command primarily shows differences in content hashes (for data) or values (for metrics/plots) rather than line-by-line file differences within large data files themselves. For detailed content diffs of text files, traditional git diff or specialized tools are often needed.
INTEGRATION WITH GIT
dvc-diff is deeply integrated with Git. When you run dvc diff without specifying revisions, it compares the current workspace with the last committed state. When revisions are provided, it uses Git's underlying revision graph to identify the corresponding DVC-tracked files and compare their content hashes, metrics, or plot definitions.
USE CASES
Typical use cases for dvc-diff include comparing the performance of two different model training runs, analyzing how changes in data preprocessing affect downstream metrics, or simply understanding what data artifacts have changed between two branches of a project. It's essential for reproducible machine learning and collaborative data science.
HISTORY
The dvc-diff command has been a core component of DVC since its early development. DVC was first released in 2017, and the ability to compare versions of data and models was fundamental to its mission of providing Git-like version control for machine learning projects. Initially, it focused primarily on data content diffs. Over time, as DVC evolved to include features like metrics and plot tracking, dvc-diff was extended to support these new data types, providing a comprehensive view of experiment changes.
SEE ALSO
git diff(1), dvc status(1), dvc metrics show(1), dvc plots show(1)