duckdb
Run SQL queries on data
TLDR
Start an interactive shell with a transient in-memory database
Start an interactive shell on a database file. If the file does not exist, a new database is created
Query a CSV, JSON, or Parquet file using SQL
Directly query a CSV, JSON, or Parquet file using the file view
Run an SQL script
Run query on database file and keep the interactive shell open
Read CSV from stdin and write CSV to stdout
Start the DuckDB UI, a web interface with notebooks
SYNOPSIS
duckdb [OPTIONS] [DATABASE_FILE]
PARAMETERS
DATABASE_FILE
(Optional) The path to the database file. If omitted, DuckDB operates as an in-memory database, and all data is lost upon exit.
-readonly | --readonly
Opens the specified database file in read-only mode, preventing any modifications.
-s "SQL_QUERY" | -c "SQL_QUERY"
Executes a single SQL query or command provided as a string and then exits the DuckDB shell. Useful for scripting.
--version
Displays the DuckDB version information and exits.
--help
Displays the command-line help message and exits.
-init FILE
Executes SQL commands from the specified FILE upon starting the DuckDB shell, before any interactive input or queries.
DESCRIPTION
DuckDB is an open-source, in-process analytical database management system (DBMS) designed for high-performance OLAP (Online Analytical Processing) workloads.
Unlike traditional client-server databases, DuckDB runs directly within the application process, making it lightweight, easy to embed, and eliminating network overhead. It supports standard SQL, making it accessible to anyone familiar with relational databases.
Key features include column-oriented storage, vectorized query execution, and parallel processing, which contribute to its exceptional speed for analytical queries involving aggregations, joins, and complex filtering on large datasets. It integrates seamlessly with data analysis tools and programming languages like Python and R. DuckDB can operate entirely in memory or persist data to a single, compact file, offering flexibility for both temporary and persistent data storage.
CAVEATS
In-Memory Default: If no DATABASE_FILE is specified, DuckDB runs entirely in memory, and all data is lost when the command process terminates. For persistent storage, always specify a database file name.
Memory Consumption: While efficient, DuckDB can consume significant RAM for very large analytical queries, especially when processing datasets that exceed available memory without proper spilling mechanisms.
Concurrency: DuckDB is designed for single-process analytical workloads and is not intended as a replacement for multi-user, client-server OLTP (Online Transaction Processing) databases. It lacks typical DBMS features like network protocols, complex user management, and high-availability replication.
PERSISTENT VS. IN-MEMORY OPERATION
When invoked without a DATABASE_FILE (e.g., `duckdb`), it creates a temporary in-memory database that is discarded when the process exits. To save your data permanently, provide a file path (e.g., `duckdb my_data.duckdb`). All data, tables, and views will be stored within this single file.
SQL DIALECT AND EXTENSIONS
DuckDB supports a large subset of standard SQL and includes many analytical extensions, window functions, and advanced data types (e.g., STRUCT, MAP, ARRAY) optimized for data analysis. It also offers powerful SQL functions for file system interaction and external data loading.
DIRECT FILE QUERYING
A unique strength of DuckDB is its ability to directly query various data formats (like Parquet, CSV, JSON, Apache Arrow) without explicit loading. You can run SQL queries directly on files from your local filesystem or remote storage (HTTP/S3), e.g., `SELECT * FROM 'data.parquet';`.
HISTORY
DuckDB originated from research at CWI (Centrum Wiskunde & Informatica), the national research institute for mathematics and computer science in the Netherlands. Its first public releases began around 2019-2020. It was developed with a focus on providing an embedded, high-performance analytical database for a growing ecosystem of data science and analytics tools. Its design emphasizes columnar storage, vectorized execution, and single-file deployment, quickly gaining popularity as a powerful alternative for local data analysis workflows.