csvgrep
Search CSV files based on column content
TLDR
Find rows that have a certain string in column 1
Find rows in which columns 3 or 4 match a certain regex
Find rows in which the "name" column does NOT include the string "John Doe"
SYNOPSIS
csvgrep [CSV_OPTIONS] -c COLUMNS (-r REGEX | -f REGEX_FILE) [--not-match] [FILE]
PARAMETERS
-h, --help
Displays the help message and exits.
-d DELIMITER, --delimiter DELIMITER
Specifies the single-character field delimiter of the input CSV file. Default is comma.
-t, --tabs
Treats the input CSV file as tab-delimited. Overrides -d.
-q QUOTECHAR, --quotechar QUOTECHAR
Specifies the single-character quoting character of the input CSV file. Default is double quote.
-u {0,1,2,3}, --quoting {0,1,2,3}
Controls when quotes are added to output fields. 0 = QUOTE_MINIMAL, 1 = QUOTE_ALL, 2 = QUOTE_NONNUMERIC, 3 = QUOTE_NONE.
--skip-lines SKIP_LINES
The number of initial non-CSV lines to skip before processing the CSV data.
--blanks
Do not convert quoted empty strings to NULL values. By default, empty strings are treated as NULL.
--no-inference
Disables type inference when parsing the input. All values will be treated as strings.
--zero
Enables 0-based indexing for columns. By default, 1-based indexing is used.
-c COLUMNS, --columns COLUMNS
A comma-separated list of column names or 1-based (or 0-based with --zero) indices to operate on. This option is mandatory for filtering.
-r REGEX, --regex REGEX
A regular expression to match against the specified columns. Mutually exclusive with -f.
-f REGEX_FILE, --file REGEX_FILE
A file containing regular expressions (one per line) to match. Mutually exclusive with -r.
-i, --ignore-case
Perform case-insensitive matching for regular expressions.
-m, --match
When multiple columns or expressions are used, match if all of them are true (AND logic). By default, it's OR logic.
--not-match
Instead of matching rows, exclude them (inverse matching).
FILE
The path to the CSV file to operate on. If omitted, csvgrep reads from standard input (stdin).
DESCRIPTION
csvgrep is a powerful command-line utility from the csvkit suite, designed specifically for filtering rows in CSV files based on values within specified columns. It operates akin to the traditional grep command but is optimized for structured CSV data, understanding headers, columns, and various CSV dialects. Users can target columns by name or numerical index and apply regular expressions to match or exclude rows. With options for case-insensitive matching (-i), logical AND/OR combinations (-m), and inverse matching (--not-match), csvgrep is invaluable for data cleaning, extracting specific subsets, or preparing data for further analysis. It seamlessly integrates with other csvkit tools and standard UNIX pipes.
CAVEATS
Due to its Python-based implementation, csvgrep might exhibit slower performance compared to native C-based tools for very large datasets, especially when not streaming data efficiently. Regular expressions are processed using Python's re module, so users should be familiar with its specific syntax. For extremely large files, memory usage can be a concern if the entire file is loaded into memory, though csvkit generally aims for streaming where possible.
REGULAR EXPRESSION SYNTAX
csvgrep leverages Python's powerful re module for regular expression processing. This provides a rich and flexible pattern matching capability. Users should consult Python's official regular expression documentation for detailed syntax and features.
INPUT AND OUTPUT
By default, csvgrep reads CSV data from standard input (stdin) if no FILE is specified, and writes the filtered output to standard output (stdout). This design promotes seamless integration into UNIX pipelines, allowing it to be chained with other csvkit commands or traditional UNIX utilities like head, tail, or less.
COLUMN SELECTION FLEXIBILITY
Columns can be selected either by their header name (e.g., -c "City,State") or by their numerical index (e.g., -c 1,2). By default, column indices are 1-based, but the --zero option can be used to switch to 0-based indexing for convenience.
HISTORY
csvgrep is an integral part of csvkit, an open-source suite of command-line tools for working with CSV files. Conceived by Christopher Groskopf and initially released around 2012, csvkit was developed to bridge the gap between traditional UNIX text processing utilities and the structured nature of CSV data. It provides familiar functionality like grep, cut, and sort but with full awareness of CSV's columnar structure and various dialects, making it a robust and user-friendly solution for data professionals.