LinuxCommandLibrary

dvc-init

Initialize DVC repository in a project

TLDR

Initialize a new local repository

$ dvc init
copy

Initialize DVC without Git
$ dvc init --no-scm
copy

Initialize DVC in a subdirectory
$ cd [path/to/subdir] && dvc init --sudir
copy

SYNOPSIS

dvc init [<path>] [--no-scm] [--subdir] [--quiet] [--help]

PARAMETERS

<path>
    An optional path where the DVC repository should be initialized. If not provided, it initializes in the current working directory.

--no-scm
    Do not initialize a Git repository or integrate with an existing one. This means DVC will not modify .gitignore or install Git hooks.

--subdir
    Initialize DVC in a subdirectory that is already part of a parent Git repository. This allows creating a DVC repo nested within a larger Git project without creating a new Git repo in the subdirectory.

--quiet
    Suppress all output from the command.

--help
    Show the help message and exit.

DESCRIPTION

The dvc init command is the essential first step to start versioning your data, models, and machine learning pipelines with DVC (Data Version Control). It sets up a new DVC repository in the current directory or a specified path.

When executed, dvc init performs several key actions:
1. Creates the .dvc/ directory: This hidden directory stores DVC's internal files, including configuration (.dvc/config), cache information, and DVC's dependency graph.
2. Initializes configuration: It sets up the basic DVC configuration, which can later be extended to include remote storage settings.
3. Integrates with Git (by default): If a Git repository exists or is initialized, dvc init automatically modifies the .gitignore file to ignore DVC's cache directory (usually .dvc/cache) and adds DVC-related Git hooks (e.g., pre-commit) to ensure DVC files are committed correctly. This deep integration allows DVC to leverage Git for versioning metadata while handling large data files separately.

It's typically run at the root of a Git repository to ensure seamless version control of both code and data.

CAVEATS

1. Existing Repository: If a .dvc directory already exists in the target path, dvc init will exit with an error, preventing accidental re-initialization.
2. Git Integration: While DVC can work independently, its full power comes from integration with Git. Using --no-scm will prevent automatic .gitignore modifications and Git hook installations, requiring manual setup if Git is later introduced.
3. Repository Root: For optimal functionality, it's recommended to run dvc init at the root of your project's Git repository. Initializing DVC in a subdirectory without --subdir can lead to unexpected behavior regarding Git integration and cache management.

DEFAULT GIT INTEGRATION

By default, dvc init attempts to integrate with Git. It modifies the project's .gitignore file to include DVC's cache directory (usually .dvc/cache), ensuring large data files tracked by DVC are not committed directly to Git. It also installs Git hooks (e.g., pre-commit) which help DVC manage its internal files, like .dvc files, alongside your code. This seamless integration is a core strength of DVC, allowing users to leverage familiar Git workflows for both code and data.

CORE FILES CREATED

Upon successful initialization, dvc init creates the following key components:
1. .dvc/ directory: The central hub for DVC's operations, containing configuration, cache link management, and other internal files.
2. .dvc/.gitignore file: Often created inside the .dvc/ directory to ignore internal DVC files from being tracked by Git. If no parent .gitignore exists, it might create one at the project root.
3. .dvc/config file: The primary configuration file for the DVC repository, where you can define remote storage, cache settings, and other project-specific options.

HISTORY

DVC (Data Version Control) was created by Iterative.ai, with its initial release in 2017. It was designed to address the challenges of versioning large datasets and machine learning models, which traditional source control systems like Git struggle with due to their size. dvc init has been a foundational command since DVC's inception, serving as the gateway to establishing a DVC-managed project. Its development has mirrored DVC's evolution, becoming more robust in its Git integration and handling of diverse project structures.

SEE ALSO

dvc add(1), dvc run(1), dvc pull(1), dvc push(1), git-init(1), git(1)

Copied to clipboard