tsv-filter

Filter rows in tab-separated value (TSV) data

TLDR

Print the lines where a specific column is numerically equal to a given number

$ tsv-filter -H --eq [field_name]:[number] [path/to/tsv_file]

Print the lines where a specific column is [eq]ual/[n]on [e]qual/[l]ess [t]han/[l]ess than or [e]qual/[g]reater [t]han/[g]reater than or [e]qual to a given number

$ tsv-filter --[eq|ne|lt|le|gt|ge] [column_number]:[number] [path/to/tsv_file]

Print the lines where a specific column is [eq]ual/[n]ot [e]qual/part of/not part of a given string

$ tsv-filter --str-[eq|ne|in-fld|not-in-fld] [column_number]:[string] [path/to/tsv_file]

Filter for non-empty fields

$ tsv-filter --not-empty [column_number] [path/to/tsv_file]

Print the lines where a specific column is empty

$ tsv-filter --invert --not-empty [column_number] [path/to/tsv_file]

Print the lines that satisfy two conditions

$ tsv-filter --eq [column_number1]:[number] --str-eq [column_number2]:[string] [path/to/tsv_file]

Print the lines that match at least one condition

$ tsv-filter --or --eq [column_number1]:[number] --str-eq [column_number2]:[string] [path/to/tsv_file]

Count matching lines, interpreting first line as a [H]eader

$ tsv-filter --count -H --eq [field_name]:[number] [path/to/tsv_file]

SYNOPSIS

tsv-filter [options] [TSV_FILE]
tsv-filter [options] < TSV_FILE

-H, --header
    Treats the first line of the input file as a header row. This row will be passed directly to the output without being subject to filtering.

-f fields, --fields fields
    Specifies a comma-separated list of field names or 1-based column numbers to apply the condition to. For example, 'C1,C3' or 'field_name'.

-C condition, --condition condition
    Defines the filtering condition using tsv-filter's expression language. Examples: 'C1 == "value"', 'C2 > 10', 'C3 ~ /pattern/', 'C4 in ("a", "b")', 'C5 is NA'.

-o file, --output file
    Directs the output to the specified file instead of standard output (stdout).

-v, --version
    Displays the version information of the tsv-filter command and exits.

-h, --help
    Shows a help message detailing command usage and options, then exits.

DESCRIPTION

The tsv-filter command, part of the tsv-utils suite, is a powerful and efficient tool designed for filtering rows from Tab-Separated Values (TSV) files. It allows users to selectively include or exclude rows based on conditions applied to specific columns.

Utilizing a specialized expression language, tsv-filter can perform various types of comparisons, including equality, inequality, numerical comparisons, regular expression matching, and checking for `NA` (Not Available) values or membership in a list. This makes it highly versatile for data cleaning, subsetting, and analysis tasks.

Its design prioritizes performance, making it well-suited for processing large datasets. It can read from standard input or a specified file, and write to standard output or a designated output file, seamlessly integrating into command-line pipelines.

CAVEATS

tsv-filter's condition language is powerful but requires understanding its specific syntax for optimal use. Incorrectly formatted conditions can lead to errors.
While efficient, performance on extremely large files with very complex regular expressions might still be resource-intensive. Ensure sufficient memory for large datasets.

CONDITION LANGUAGE OVERVIEW

The tsv-filter condition language supports various operators:
Comparison: `==`, `!=`, `<`, `<=`, `>`, `>=` (for numbers and strings).
Regex Matching: `~` (matches), `!~` (does not match).
Membership: `in` (value in list), `!in` (value not in list).
Null/NA Check: `is NA`, `is not NA`.
Logical Operators: `&&` (AND), `||` (OR), `!` (NOT).
Parentheses: For grouping expressions.
Column references are typically `C1`, `C2`, etc., or by header name if `-H` is used.

HISTORY

tsv-filter is a component of the tsv-utils project, an open-source suite of command-line tools developed by eBay Inc. The project was initiated to provide high-performance utilities specifically designed for manipulating large Tab-Separated Values (TSV) datasets, leveraging Go for its concurrency and speed. It has gained popularity among data engineers and analysts for its efficiency and clear syntax when working with structured text data.