LinuxCommandLibrary

csvsql

Execute SQL queries against CSV files

TLDR

Generate a CREATE TABLE SQL statement for a CSV file

$ csvsql [path/to/data.csv]
copy

Import a CSV file into an SQL database
$ csvsql --insert --db "[mysql://user:password@host/database]" [data.csv]
copy

Run an SQL query on a CSV file
$ csvsql --query "[select * from 'data']" [data.csv]
copy

SYNOPSIS

csvsql [OPTIONS] FILES...

Examples:
csvsql --query "SELECT column_a, column_b FROM data.csv WHERE column_a > 10"
csvsql --db sqlite:///my_data.db --insert file1.csv file2.csv

PARAMETERS

--query
    The SQL query to execute against the input CSV files.

--tables
    List table names derived from the input CSV filenames without executing a query.

--dialect
    Specify the SQL dialect for the query (e.g., sqlite, postgresql, mysql). Defaults to sqlite.

--db
    Provide a SQLAlchemy connection string to connect to an external database (e.g., postgresql://user:pass@host/db).

--insert
    Insert data from the CSV files into the database specified by --db. CSV filenames become table names.

--no-create
    Used with --insert, it prevents csvsql from creating tables if they already exist in the database.

--table
    Specify a single table name to use for the input CSV file, overriding the default filename-based name.

--sniff-limit
    Limit the number of bytes to sniff from the start of the file for column type inference (default: 102400).

--no-inference
    Disable column type inference; all columns will be treated as text.

--encoding
    Specify the input file encoding (default: utf-8).

--unique-constraint
    Add a unique constraint to the specified column(s) when creating a table via --insert.

--no-header-row
    Treat the first row as data rather than column headers.

--blanks
    Do not convert empty strings to NULLs when inserting data into the database.

--skip-errors
    Skip problematic rows that cause errors during data processing or insertion.

DESCRIPTION


csvsql is a powerful command-line tool from the csvkit suite that enables users to execute SQL queries directly against one or more CSV files. It treats each CSV file as a database table, intelligently inferring column data types by default. This capability allows for complex data manipulation, filtering, aggregation, and joining operations on tabular data without the overhead of loading it into a traditional relational database management system. It leverages SQLAlchemy for database abstraction, meaning it can query CSVs in-memory (using SQLite) or connect to external databases to create tables and insert data. Ideal for quick data analysis and transformations.

CAVEATS

While highly convenient, csvsql's performance for extremely large files (gigabytes) might be slower than a dedicated database import and query. Its column type inference, though intelligent, is not always flawless and may require manual override with --no-inference or type casting within SQL. Memory usage can also be a factor for very large datasets if the entire file needs to be loaded into memory for processing.

IN-MEMORY SQL VS. EXTERNAL DATABASE

By default, csvsql uses an in-memory SQLite database to process queries against CSV files. This is fast and requires no external setup. However, with the --db option, it can connect to and query any database supported by SQLAlchemy (e.g., PostgreSQL, MySQL, Oracle), allowing for data insertion or more persistent query operations.

COLUMN TYPE INFERENCE

csvsql attempts to automatically detect column data types (e.g., text, integer, float, date) by inspecting the first few rows of each CSV. This simplifies querying as you don't need to manually define schemas. For specific cases, --sniff-limit controls the amount of data sniffed, and --no-inference disables this feature entirely, treating all columns as text.

HISTORY

csvsql is an integral part of csvkit, an open-source suite of utilities created by Christopher Groskopf. Initiated around 2011, csvkit was developed to address the common need for robust, Unix-pipe-friendly tools to process CSV data, often mimicking the capabilities found in traditional databases or spreadsheet software. csvsql specifically aimed to bring the power and flexibility of SQL queries directly to flat CSV files, becoming a cornerstone for data analysts and developers working with structured text data.

SEE ALSO

csvcut(1), csvjoin(1), csvgrep(1), csvsort(1), csvstack(1), sqlite3(1)

Copied to clipboard