dvc-checkout
Restore data tracked by DVC
TLDR
Checkout the latest version of all target files and directories
Checkout the latest version of a specified target
Checkout a specific version of a target from a different Git commit/tag/branch
SYNOPSIS
dvc checkout [
Commonly used forms:
dvc checkout
dvc checkout my_data.dvc
dvc checkout data/features/
dvc checkout data.csv model.pkl
PARAMETERS
[
Optional: One or more paths to DVC-tracked files, directories, or .dvc files. If omitted, checks out all .dvc files in the current directory and subdirectories.
--relink
Forces recreation of links or copies even if the target is already present and correct. Useful for fixing broken links or changing link types.
-q, --quiet
Suppresses all output messages, showing only errors.
-v, --verbose
Enables verbose output, showing more detailed information about the operation.
--jobs
Specifies the number of parallel jobs to run when materializing data. Defaults to 1 (no parallelism). Set to 0 to use all available CPU cores.
--no-run-cache
Disables the use of the DVC run cache when checking out stages. This means stages will be re-run instead of using cached outputs, if applicable.
--with-deps
When used with
DESCRIPTION
dvc checkout
is a fundamental command in Data Version Control (DVC) used to restore data files and directories tracked by DVC to a specific state or version. It materializes the data from the DVC cache into your workspace.
Similar to how Git restores source code files, dvc checkout
ensures data reproducibility by linking or copying the correct data versions identified by corresponding .dvc files. When executed without arguments, it checks out all DVC-tracked files and directories in the current repository. If paths are specified, it operates only on those targets.
This command is crucial for switching between different data versions, reverting unwanted changes, or setting up a clean working environment based on a specific Git commit that includes DVC-tracked data. It interacts with the DVC cache, which stores immutable versions of your data. If the required data is not in the local cache, you might need to use dvc pull to retrieve it from a remote storage.
CAVEATS
Data Availability: dvc checkout
retrieves data from the local DVC cache. If the required data is not in the cache, you will need to run dvc pull first to download it from a configured remote storage.
Overwriting Changes: Be cautious when running dvc checkout
as it can overwrite existing files in your workspace that are tracked by DVC. Ensure you have committed or stashed any unsaved changes if they are important.
Git Integration: dvc checkout
works best when your DVC repository is integrated with a Git repository. It often follows Git checkouts to align data versions with code versions.
HISTORY
DVC (Data Version Control) was first open-sourced by Iterative.ai in 2017. The dvc checkout
command has been a core component of DVC since its early versions, designed to mirror the familiarity of Git's checkout mechanism but applied specifically to machine learning data and models. Its purpose is to facilitate the reproducibility and versioning of large datasets and model files, which Git is not designed to handle efficiently. The command's functionality has evolved to include options like --jobs for parallel processing and --with-deps for managing dependencies, improving performance and usability over time.
SEE ALSO
dvc add(1) - Adds files and directories to DVC for version control., dvc pull(1) - Downloads data from DVC remote storage to the local cache and checks it out., dvc push(1) - Uploads data from the local DVC cache to a remote storage., dvc status(1) - Shows the status of DVC-tracked files and stages., dvc repro(1) - Reproduces a DVC pipeline by running stages., git checkout(1) - The analogous Git command for source code, often used in conjunction with dvc checkout.