LinuxCommandLibrary

kaggle-datasets

Manage Kaggle datasets

TLDR

List all datasets owned by a user or organization

$ kaggle [[d|datasets]] list --user [username]
copy

Search dataset by name
$ kaggle [[d|datasets]] list [[-s|--search]] "[dataset_name]"
copy

Download a dataset
$ kaggle [[d|datasets]] download "[dataset_name]"
copy

Create a public dataset
$ kaggle [[d|datasets]] create [[-p|--path]] [path/to/dataset] [[-u|--public]]
copy

Download metadata of dataset
$ kaggle [[d|datasets]] metadata [dataset_name]
copy

Initialize metadata for dataset
$ kaggle [[d|datasets]] init [[-p|--path]] [path/to/dataset]
copy

Delete a dataset
$ kaggle [[d|datasets]] delete [dataset_name]
copy

SYNOPSIS

kaggle datasets <list | files | download | create | init | version | status | delete | update | metadata | private | public | roles | add-tag | remove-tag> [options]

PARAMETERS

list
    Lists available datasets on Kaggle, with extensive options for filtering by search terms, owner, size, file type, license, tags, user, language, and sorting.

files <dataset-slug>
    Displays a list of all files contained within a specified dataset.

download <dataset-slug>
    Downloads files from a specified dataset to the local filesystem. Supports downloading specific files or all dataset content, and can skip existing files.

create -p <path>
    Creates a new dataset on Kaggle from files located in a local directory. This operation requires a correctly formatted dataset-metadata.json file within the directory for initial setup.

init -p <path>
    Initializes a local directory for a new dataset by generating a template dataset-metadata.json file, guiding the user in defining dataset properties.

version <dataset-slug>
    Retrieves and displays detailed metadata about a specific version of a dataset, including file hashes, creation date, and associated comments.

status <dataset-slug>
    Checks and reports the current upload or processing status of a dataset, useful after creating or updating a dataset.

delete <dataset-slug>
    Permanently deletes a specified dataset from Kaggle. This action is irreversible.

update -p <path>
    Updates an existing dataset with new files or modified metadata from a local directory, creating a new dataset version.

metadata <dataset-slug>
    Views or allows interactive editing of a dataset's metadata, such as its title, description, and tags.

private <dataset-slug>
    Changes the visibility of the specified dataset to private, restricting access to only its collaborators.

public <dataset-slug>
    Changes the visibility of the specified dataset to public, making it accessible to the entire Kaggle community.

roles <dataset-slug>
    Manages user roles and permissions for a dataset, allowing sharing and collaboration control.

add-tag <dataset-slug> <tag>
    Adds a specified tag to a dataset, improving its discoverability and categorization.

remove-tag <dataset-slug> <tag>
    Removes a specified tag from a dataset.

DESCRIPTION

The kaggle-datasets command is an essential component of the Kaggle API client, offering a powerful command-line interface for seamless interaction with datasets hosted on Kaggle.com. It empowers data scientists, machine learning practitioners, and researchers to programmatically list, search, download, create, update, and delete datasets, as well as manage their associated metadata and permissions. This tool significantly streamlines data workflows by enabling the automation of data acquisition, publication, and version control directly from the terminal or scripts. It integrates effortlessly into existing data pipelines and development environments. Users can explore dataset files, monitor upload status, and control dataset visibility (public or private) without needing to access the Kaggle website directly, thereby enhancing productivity for a wide range of data-centric projects. The command provides granular control over dataset lifecycle management, from initial creation to deprecation.

CAVEATS

Using kaggle-datasets necessitates prior configuration of Kaggle API credentials, typically through a kaggle.json file located in the ~/.kaggle/ directory. Users must adhere to the dataset slug format (e.g., ownerUsername/dataset-name) when referencing specific datasets. Operations involving large datasets, such as downloads or uploads, can be time-consuming and bandwidth-intensive. For dataset creation and updates, the accuracy and proper formatting of the dataset-metadata.json file are critical for successful execution.

AUTHENTICATION

The kaggle-datasets command, like all Kaggle API tools, requires user authentication. This is typically achieved by placing a kaggle.json file, which contains your Kaggle username and API key, in the ~/.kaggle/ directory. This file can be securely downloaded from your Kaggle account settings page.

DATASET SLUG FORMAT

Many kaggle-datasets subcommands require a 'dataset slug' to uniquely identify the target dataset. This slug follows a specific format: ownerUsername/dataset-name. For instance, 'kaggle/titanic' refers to the well-known Titanic dataset maintained by Kaggle itself.

HISTORY

The Kaggle API and its command-line client were developed to provide programmatic access to Kaggle's platform, extending capabilities beyond the web interface. This innovation enabled significant automation for data science workflows, simplifying the management of datasets, competition submissions, and kernel interactions directly from scripts or CI/CD pipelines. The kaggle-datasets command specifically evolved to support the entire lifecycle of data, from publishing new resources and maintaining versions to facilitating community sharing, thereby becoming a cornerstone for collaborative data projects on the platform.

SEE ALSO

kaggle competitions(1), kaggle kernels(1), kaggle models(1), curl(1)

Copied to clipboard