duckdb

Run SQL queries on data

TLDR

Start an interactive shell with a transient in-memory database

$ duckdb

Start an interactive shell on a database file. If the file does not exist, a new database is created

$ duckdb [path/to/dbfile]

Query a CSV, JSON, or Parquet file using SQL

$ duckdb -c "[SELECT * FROM 'data_source.[csv|csv.gz|json|json.gz|parquet]']"

Directly query a CSV, JSON, or Parquet file using the file view

$ duckdb [data_source.[csv|csv.gz|json|json.gz|parquet]] -c "[ SELECT * FROM file ]"

Run an SQL script

$ duckdb -f [path/to/script.sql]

Run query on database file and keep the interactive shell open

$ duckdb [path/to/dbfile] -cmd "[SELECT DISTINCT * FROM tbl]"

Read CSV from stdin and write CSV to stdout

$ cat [path/to/source.csv] | duckdb -c "[COPY (FROM read_csv('/dev/stdin')) TO '/dev/stdout' WITH (FORMAT CSV, HEADER)]"

Start the DuckDB UI, a web interface with notebooks

$ duckdb -ui

DATABASE_FILE
    (Optional) The path to the database file. If omitted, DuckDB operates as an in-memory database, and all data is lost upon exit.

-readonly | --readonly
    Opens the specified database file in read-only mode, preventing any modifications.

-s "SQL_QUERY" | -c "SQL_QUERY"
    Executes a single SQL query or command provided as a string and then exits the DuckDB shell. Useful for scripting.

--version
    Displays the DuckDB version information and exits.

--help
    Displays the command-line help message and exits.

-init FILE
    Executes SQL commands from the specified FILE upon starting the DuckDB shell, before any interactive input or queries.

DESCRIPTION

DuckDB is an open-source, in-process analytical database management system (DBMS) designed for high-performance OLAP (Online Analytical Processing) workloads.

Unlike traditional client-server databases, DuckDB runs directly within the application process, making it lightweight, easy to embed, and eliminating network overhead. It supports standard SQL, making it accessible to anyone familiar with relational databases.

Key features include column-oriented storage, vectorized query execution, and parallel processing, which contribute to its exceptional speed for analytical queries involving aggregations, joins, and complex filtering on large datasets. It integrates seamlessly with data analysis tools and programming languages like Python and R. DuckDB can operate entirely in memory or persist data to a single, compact file, offering flexibility for both temporary and persistent data storage.

CAVEATS

In-Memory Default: If no DATABASE_FILE is specified, DuckDB runs entirely in memory, and all data is lost when the command process terminates. For persistent storage, always specify a database file name.

Memory Consumption: While efficient, DuckDB can consume significant RAM for very large analytical queries, especially when processing datasets that exceed available memory without proper spilling mechanisms.

Concurrency: DuckDB is designed for single-process analytical workloads and is not intended as a replacement for multi-user, client-server OLTP (Online Transaction Processing) databases. It lacks typical DBMS features like network protocols, complex user management, and high-availability replication.

PERSISTENT VS. IN-MEMORY OPERATION

When invoked without a DATABASE_FILE (e.g., `duckdb`), it creates a temporary in-memory database that is discarded when the process exits. To save your data permanently, provide a file path (e.g., `duckdb my_data.duckdb`). All data, tables, and views will be stored within this single file.

SQL DIALECT AND EXTENSIONS

DuckDB supports a large subset of standard SQL and includes many analytical extensions, window functions, and advanced data types (e.g., STRUCT, MAP, ARRAY) optimized for data analysis. It also offers powerful SQL functions for file system interaction and external data loading.

DIRECT FILE QUERYING

A unique strength of DuckDB is its ability to directly query various data formats (like Parquet, CSV, JSON, Apache Arrow) without explicit loading. You can run SQL queries directly on files from your local filesystem or remote storage (HTTP/S3), e.g., `SELECT * FROM 'data.parquet';`.

HISTORY

DuckDB originated from research at CWI (Centrum Wiskunde & Informatica), the national research institute for mathematics and computer science in the Netherlands. Its first public releases began around 2019-2020. It was developed with a focus on providing an embedded, high-performance analytical database for a growing ecosystem of data science and analytics tools. Its design emphasizes columnar storage, vectorized execution, and single-file deployment, quickly gaining popularity as a powerful alternative for local data analysis workflows.