csvsql
Execute SQL queries against CSV files
TLDR
Generate a CREATE TABLE SQL statement for a CSV file
Import a CSV file into an SQL database
Run an SQL query on a CSV file
SYNOPSIS
csvsql [OPTIONS] FILES...
Examples:
csvsql --query "SELECT column_a, column_b FROM data.csv WHERE column_a > 10"
csvsql --db sqlite:///my_data.db --insert file1.csv file2.csv
PARAMETERS
--query
The SQL query to execute against the input CSV files.
--tables
List table names derived from the input CSV filenames without executing a query.
--dialect
Specify the SQL dialect for the query (e.g., sqlite, postgresql, mysql). Defaults to sqlite.
--db
Provide a SQLAlchemy connection string to connect to an external database (e.g., postgresql://user:pass@host/db).
--insert
Insert data from the CSV files into the database specified by --db. CSV filenames become table names.
--no-create
Used with --insert, it prevents csvsql from creating tables if they already exist in the database.
--table
Specify a single table name to use for the input CSV file, overriding the default filename-based name.
--sniff-limit
Limit the number of bytes to sniff from the start of the file for column type inference (default: 102400).
--no-inference
Disable column type inference; all columns will be treated as text.
--encoding
Specify the input file encoding (default: utf-8).
--unique-constraint
Add a unique constraint to the specified column(s) when creating a table via --insert.
--no-header-row
Treat the first row as data rather than column headers.
--blanks
Do not convert empty strings to NULLs when inserting data into the database.
--skip-errors
Skip problematic rows that cause errors during data processing or insertion.
DESCRIPTION
csvsql is a powerful command-line tool from the csvkit suite that enables users to execute SQL queries directly against one or more CSV files. It treats each CSV file as a database table, intelligently inferring column data types by default. This capability allows for complex data manipulation, filtering, aggregation, and joining operations on tabular data without the overhead of loading it into a traditional relational database management system. It leverages SQLAlchemy for database abstraction, meaning it can query CSVs in-memory (using SQLite) or connect to external databases to create tables and insert data. Ideal for quick data analysis and transformations.
CAVEATS
While highly convenient, csvsql's performance for extremely large files (gigabytes) might be slower than a dedicated database import and query. Its column type inference, though intelligent, is not always flawless and may require manual override with --no-inference or type casting within SQL. Memory usage can also be a factor for very large datasets if the entire file needs to be loaded into memory for processing.
IN-MEMORY SQL VS. EXTERNAL DATABASE
By default, csvsql uses an in-memory SQLite database to process queries against CSV files. This is fast and requires no external setup. However, with the --db option, it can connect to and query any database supported by SQLAlchemy (e.g., PostgreSQL, MySQL, Oracle), allowing for data insertion or more persistent query operations.
COLUMN TYPE INFERENCE
csvsql attempts to automatically detect column data types (e.g., text, integer, float, date) by inspecting the first few rows of each CSV. This simplifies querying as you don't need to manually define schemas. For specific cases, --sniff-limit controls the amount of data sniffed, and --no-inference disables this feature entirely, treating all columns as text.
HISTORY
csvsql is an integral part of csvkit, an open-source suite of utilities created by Christopher Groskopf. Initiated around 2011, csvkit was developed to address the common need for robust, Unix-pipe-friendly tools to process CSV data, often mimicking the capabilities found in traditional databases or spreadsheet software. csvsql specifically aimed to bring the power and flexibility of SQL queries directly to flat CSV files, becoming a cornerstone for data analysts and developers working with structured text data.