csvgrep

Search CSV files based on column content

TLDR

Find rows that have a certain string in column 1

$ csvgrep [[-c|--columns]] [1] [[-m|--match]] [string_to_match] [data.csv]

Find rows in which columns 3 or 4 match a certain regex

$ csvgrep [[-c|--columns]] [3,4] [[-r|--regex]] [regex] [data.csv]

Find rows in which the "name" column does NOT include the string "John Doe"

$ csvgrep [[-i|--invert-match]] [[-c|--columns]] [name] [[-m|--match]] "[John Doe]" [data.csv]

SYNOPSIS

csvgrep [CSV_OPTIONS] -c COLUMNS (-r REGEX | -f REGEX_FILE) [--not-match] [FILE]

-h, --help
    Displays the help message and exits.

-d DELIMITER, --delimiter DELIMITER
    Specifies the single-character field delimiter of the input CSV file. Default is comma.

-t, --tabs
    Treats the input CSV file as tab-delimited. Overrides -d.

-q QUOTECHAR, --quotechar QUOTECHAR
    Specifies the single-character quoting character of the input CSV file. Default is double quote.

-u {0,1,2,3}, --quoting {0,1,2,3}
    Controls when quotes are added to output fields. 0 = QUOTE_MINIMAL, 1 = QUOTE_ALL, 2 = QUOTE_NONNUMERIC, 3 = QUOTE_NONE.

--skip-lines SKIP_LINES
    The number of initial non-CSV lines to skip before processing the CSV data.

--blanks
    Do not convert quoted empty strings to NULL values. By default, empty strings are treated as NULL.

--no-inference
    Disables type inference when parsing the input. All values will be treated as strings.

--zero
    Enables 0-based indexing for columns. By default, 1-based indexing is used.

-c COLUMNS, --columns COLUMNS
    A comma-separated list of column names or 1-based (or 0-based with --zero) indices to operate on. This option is mandatory for filtering.

-r REGEX, --regex REGEX
    A regular expression to match against the specified columns. Mutually exclusive with -f.

-f REGEX_FILE, --file REGEX_FILE
    A file containing regular expressions (one per line) to match. Mutually exclusive with -r.

-i, --ignore-case
    Perform case-insensitive matching for regular expressions.

-m, --match
    When multiple columns or expressions are used, match if all of them are true (AND logic). By default, it's OR logic.

--not-match
    Instead of matching rows, exclude them (inverse matching).

FILE
    The path to the CSV file to operate on. If omitted, csvgrep reads from standard input (stdin).

DESCRIPTION

csvgrep is a powerful command-line utility from the csvkit suite, designed specifically for filtering rows in CSV files based on values within specified columns. It operates akin to the traditional grep command but is optimized for structured CSV data, understanding headers, columns, and various CSV dialects. Users can target columns by name or numerical index and apply regular expressions to match or exclude rows. With options for case-insensitive matching (-i), logical AND/OR combinations (-m), and inverse matching (--not-match), csvgrep is invaluable for data cleaning, extracting specific subsets, or preparing data for further analysis. It seamlessly integrates with other csvkit tools and standard UNIX pipes.

CAVEATS

Due to its Python-based implementation, csvgrep might exhibit slower performance compared to native C-based tools for very large datasets, especially when not streaming data efficiently. Regular expressions are processed using Python's re module, so users should be familiar with its specific syntax. For extremely large files, memory usage can be a concern if the entire file is loaded into memory, though csvkit generally aims for streaming where possible.

REGULAR EXPRESSION SYNTAX

csvgrep leverages Python's powerful re module for regular expression processing. This provides a rich and flexible pattern matching capability. Users should consult Python's official regular expression documentation for detailed syntax and features.

INPUT AND OUTPUT

By default, csvgrep reads CSV data from standard input (stdin) if no FILE is specified, and writes the filtered output to standard output (stdout). This design promotes seamless integration into UNIX pipelines, allowing it to be chained with other csvkit commands or traditional UNIX utilities like head, tail, or less.

COLUMN SELECTION FLEXIBILITY

Columns can be selected either by their header name (e.g., -c "City,State") or by their numerical index (e.g., -c 1,2). By default, column indices are 1-based, but the --zero option can be used to switch to 0-based indexing for convenience.

HISTORY

csvgrep is an integral part of csvkit, an open-source suite of command-line tools for working with CSV files. Conceived by Christopher Groskopf and initially released around 2012, csvkit was developed to bridge the gap between traditional UNIX text processing utilities and the structured nature of CSV data. It provides familiar functionality like grep, cut, and sort but with full awareness of CSV's columnar structure and various dialects, making it a robust and user-friendly solution for data professionals.