datamash
Perform basic numeric/string grouping calculations
TLDR
Get max, min, mean and median of a single column of numbers
Get the mean of a single column of float numbers (floats must use "," and not ".")
Get the mean of a single column of numbers with a given decimal precision
Get the mean of a single column of numbers ignoring "Na" and "NaN" (literal) strings
SYNOPSIS
datamash [OPTIONS] OPERATION [FIELD...]
Examples of OPERATION:
sum, mean, median, min, max, count, distinct, range, stddev, var, concat, unique.
Examples of FIELD:
1, 2, 3, 1-3, ALL, FIELD_NAME (with -H).
PARAMETERS
-H, --header-row
Treat the first line of input as a header row. datamash will output the header row before the results.
-W, --whitespace
Use whitespace (one or more spaces or tabs) as the field delimiter instead of the default single tab character.
-t CHAR, --field-delimiter=CHAR
Use CHAR as the field delimiter for input and output. For example, -t, for CSV files.
-g N[,M...], --group=N[,M...]
Group input lines by the values in fields N, M, etc. All operations will be performed per group. Input must be sorted by the group fields unless -s is used.
-f, --full
Print all input lines along with the calculated results. Results are appended as new columns to the end of each input line.
-s, --sort
Sort the input internally by the group fields before processing. Useful when input is not pre-sorted; can be memory-intensive for large files.
--no-strict-header
Do not strictly check for header compatibility when using aggregate operations with --header-row.
--skip-header-row
Skip the header row from input, but do not print it to output unless --full is specified.
--filler=STR
Replace empty fields with STR. Default is empty string.
DESCRIPTION
datamash is a powerful and efficient command-line utility designed for performing basic numeric, textual, and statistical operations on tabular text data. It reads input from standard input or files, processes data by grouping or by column, and outputs the results. It's particularly useful for data aggregation, transformation, and statistical analysis on CSV-like or tab-separated data. Common use cases include calculating sums, averages, medians, minimums, maximums, and counts of fields, as well as concatenating or finding distinct values within groups. datamash simplifies tasks that would otherwise require more complex scripting with tools like awk, making it a go-to for quick data summaries and reports in shell pipelines. It supports various delimiters and can handle header rows.
CAVEATS
When using the -g (group by) option, input data must be sorted by the group fields. If the input is not sorted, datamash will produce incorrect results. You can use the -s option to internally sort the data, but for very large files, it might be more efficient to pre-sort the data using sort(1) before piping it to datamash. Numeric operations require valid numerical data; non-numeric values in numeric fields will typically be skipped or cause errors.
COMMON OPERATIONS
datamash supports a wide array of operations. Some of the most frequently used include:
sum: Calculates the sum of values in the specified fields.
mean: Computes the arithmetic mean (average) of values.
median: Finds the median value.
min: Determines the minimum value.
max: Determines the maximum value.
count: Counts the number of non-empty values.
distinct: Counts the number of unique non-empty values.
concat: Concatenates values in the specified fields with a comma (or custom) separator.
unique: Prints unique values in the specified fields, one per line.
USAGE WITH PIPELINES
datamash is typically used as part of a shell pipeline. It reads data from standard input (or files) and writes results to standard output, making it highly composable with other Unix commands like cat, grep, sort, cut, and awk. This allows for complex data processing workflows by chaining simple, focused commands.
HISTORY
datamash is a relatively modern utility within the GNU Project, specifically designed to address common data aggregation and transformation tasks that often become cumbersome with traditional Unix tools like awk or a combination of cut, sort, and uniq. Its development aimed to provide a dedicated, efficient, and user-friendly command-line tool for statistical calculations on tabular data, simplifying complex data manipulation pipelines for system administrators and data analysts.