LinuxCommandLibrary

dvc-diff

Show changes between DVC tracked data versions

TLDR

Compare DVC tracked files from different Git commits, tags, and branches w.r.t the current workspace

$ dvc diff [commit_hash/tag/branch]
copy

Compare the changes in DVC tracked files from 1 Git commit to another
$ dvc diff [revision1] [revision2]
copy

Compare DVC tracked files, along with their latest hash
$ dvc diff --show-hash [commit]
copy

Compare DVC tracked files, displaying the output as JSON
$ dvc diff --show-json --show-hash [commit]
copy

Compare DVC tracked files, displaying the output as Markdown
$ dvc diff --show-md --show-hash [commit]
copy

SYNOPSIS

dvc diff [old_revision] [new_revision] [options]
dvc diff [--old old_revision] [--new new_revision] [options] [paths...]

PARAMETERS

old_revision
    The old revision (Git commit, branch, or tag) to compare from. Defaults to the HEAD commit.

new_revision
    The new revision (Git commit, branch, or tag) to compare to. Defaults to the current workspace.

-h, --help
    Show the help message and exit.

-q, --quiet
    Do not print anything to standard output. Useful for scripting.

-v, --verbose
    Print detailed information for debugging.

--data
    Show changes in DVC-tracked data and directories. This is the default behavior.

--metrics
    Show changes in DVC-tracked metrics files.

--plots
    Show changes in DVC-tracked plot files.

--json
    Output the differences in JSON format. Applicable for metrics and plots.

--tsv
    Output metrics/plots differences in TSV format.

--md
    Output metrics/plots differences in Markdown table format.

paths...
    One or more specific DVC-tracked paths to diff. If not provided, all tracked items are diffed.

DESCRIPTION

dvc-diff is a powerful command within the Data Version Control (DVC) framework, designed to reveal differences in DVC-tracked files, directories, metrics, and plots between various states of your repository. Similar to how git diff compares source code, dvc diff specializes in comparing data artifacts and the results of machine learning experiments.

It can compare two arbitrary revisions (commits, branches, or tags) or compare a revision with the current workspace. By default, it shows changes in DVC-tracked data files and directories, indicating whether they are added, deleted, or modified (based on DVC's content-addressing).

Beyond just data, dvc-diff can also highlight differences in metrics files (e.g., accuracy, loss values) and plot data, making it an invaluable tool for tracking experiment progress and understanding the impact of changes in data or code on model performance.

The command leverages Git for revision management, allowing users to easily compare different versions of their data science projects.

CAVEATS

dvc-diff relies on DVC's metadata (.dvc files) and Git's revision history. If .dvc files are not committed or if the repository is not a Git repository, the command may not function as expected.

For very large datasets, calculating and displaying differences can be time-consuming, as DVC may need to compare hashes or even access remote storage.

The command primarily shows differences in content hashes (for data) or values (for metrics/plots) rather than line-by-line file differences within large data files themselves. For detailed content diffs of text files, traditional git diff or specialized tools are often needed.

INTEGRATION WITH GIT

dvc-diff is deeply integrated with Git. When you run dvc diff without specifying revisions, it compares the current workspace with the last committed state. When revisions are provided, it uses Git's underlying revision graph to identify the corresponding DVC-tracked files and compare their content hashes, metrics, or plot definitions.

USE CASES

Typical use cases for dvc-diff include comparing the performance of two different model training runs, analyzing how changes in data preprocessing affect downstream metrics, or simply understanding what data artifacts have changed between two branches of a project. It's essential for reproducible machine learning and collaborative data science.

HISTORY

The dvc-diff command has been a core component of DVC since its early development. DVC was first released in 2017, and the ability to compare versions of data and models was fundamental to its mission of providing Git-like version control for machine learning projects. Initially, it focused primarily on data content diffs. Over time, as DVC evolved to include features like metrics and plot tracking, dvc-diff was extended to support these new data types, providing a comprehensive view of experiment changes.

SEE ALSO

git diff(1), dvc status(1), dvc metrics show(1), dvc plots show(1)

Copied to clipboard