datamash

Perform basic numeric/string grouping calculations

TLDR

Get max, min, mean and median of a single column of numbers

$ seq 3 | datamash max 1 min 1 mean 1 median 1

Get the mean of a single column of float numbers (floats must use "," and not ".")

$ echo -e '1.0\n2.5\n3.1\n4.3\n5.6\n5.7' | tr '.' ',' | datamash mean 1

Get the mean of a single column of numbers with a given decimal precision

$ echo -e '1\n2\n3\n4\n5\n5' | datamash [[-R|--round]] [number_of_decimals_wanted] mean 1

Get the mean of a single column of numbers ignoring "Na" and "NaN" (literal) strings

$ echo -e '1\n2\nNa\n3\nNaN' | datamash --narm mean 1

datamash [OPTIONS] OPERATION [FIELD...]

Examples of OPERATION:
sum, mean, median, min, max, count, distinct, range, stddev, var, concat, unique.

Examples of FIELD:
1, 2, 3, 1-3, ALL, FIELD_NAME (with -H).

PARAMETERS

-H, --header-row
    Treat the first line of input as a header row. datamash will output the header row before the results.

-W, --whitespace
    Use whitespace (one or more spaces or tabs) as the field delimiter instead of the default single tab character.

-t CHAR, --field-delimiter=CHAR
    Use CHAR as the field delimiter for input and output. For example, -t, for CSV files.

-g N[,M...], --group=N[,M...]
    Group input lines by the values in fields N, M, etc. All operations will be performed per group. Input must be sorted by the group fields unless -s is used.

-f, --full
    Print all input lines along with the calculated results. Results are appended as new columns to the end of each input line.

-s, --sort
    Sort the input internally by the group fields before processing. Useful when input is not pre-sorted; can be memory-intensive for large files.

--no-strict-header
    Do not strictly check for header compatibility when using aggregate operations with --header-row.

--skip-header-row
    Skip the header row from input, but do not print it to output unless --full is specified.

--filler=STR
    Replace empty fields with STR. Default is empty string.

DESCRIPTION

datamash is a powerful and efficient command-line utility designed for performing basic numeric, textual, and statistical operations on tabular text data. It reads input from standard input or files, processes data by grouping or by column, and outputs the results. It's particularly useful for data aggregation, transformation, and statistical analysis on CSV-like or tab-separated data. Common use cases include calculating sums, averages, medians, minimums, maximums, and counts of fields, as well as concatenating or finding distinct values within groups. datamash simplifies tasks that would otherwise require more complex scripting with tools like awk, making it a go-to for quick data summaries and reports in shell pipelines. It supports various delimiters and can handle header rows.

CAVEATS

When using the -g (group by) option, input data must be sorted by the group fields. If the input is not sorted, datamash will produce incorrect results. You can use the -s option to internally sort the data, but for very large files, it might be more efficient to pre-sort the data using sort(1) before piping it to datamash. Numeric operations require valid numerical data; non-numeric values in numeric fields will typically be skipped or cause errors.

COMMON OPERATIONS

datamash supports a wide array of operations. Some of the most frequently used include:
sum: Calculates the sum of values in the specified fields.
mean: Computes the arithmetic mean (average) of values.
median: Finds the median value.
min: Determines the minimum value.
max: Determines the maximum value.
count: Counts the number of non-empty values.
distinct: Counts the number of unique non-empty values.
concat: Concatenates values in the specified fields with a comma (or custom) separator.
unique: Prints unique values in the specified fields, one per line.

USAGE WITH PIPELINES

datamash is typically used as part of a shell pipeline. It reads data from standard input (or files) and writes results to standard output, making it highly composable with other Unix commands like cat, grep, sort, cut, and awk. This allows for complex data processing workflows by chaining simple, focused commands.

HISTORY

datamash is a relatively modern utility within the GNU Project, specifically designed to address common data aggregation and transformation tasks that often become cumbersome with traditional Unix tools like awk or a combination of cut, sort, and uniq. Its development aimed to provide a dedicated, efficient, and user-friendly command-line tool for statistical calculations on tabular data, simplifying complex data manipulation pipelines for system administrators and data analysts.