datamash
Perform basic numeric/string grouping calculations
TLDR
Get max, min, mean and median of a single column of numbers
Get the mean of a single column of float numbers (floats must use "," and not ".")
Get the mean of a single column of numbers with a given decimal precision
Get the mean of a single column of numbers ignoring "Na" and "NaN" (literal) strings
SYNOPSIS
datamash [OPTIONS] operation [field...]
Example: datamash -s -g 1 sum 2 < input.txt
PARAMETERS
-h, --help
Show help message and exit.
-V, --version
Show version information and exit.
-t, --field-separator=X
Use X instead of TAB as field delimiter.
-w, --whitespace
Use whitespace (one or more spaces/tabs) for field delimiters. Overrides -t.
-d, --decimal-separator=X
Use X instead of '.' as decimal separator (e.g., use comma ',' for European notation).
-s, --sort
Sort the input before grouping. Necessary when input is not pre-sorted on the group fields.
-H, --no-header
Disable header parsing and printing. Treat the first line as data.
-i, --ignore-case
Ignore case when comparing strings (grouping or string operations).
-f, --full
Print also non-grouping fields.
-g, --group=X[,Y,Z]
Group via fields X, Y, Z. Can be given multiple times.
operations (e.g., sum, min, max, mean, median, count, first, last...)
Operation to perform. Field numbers specify which fields to use in the calculation. Check manual page for full operations list.
DESCRIPTION
Datamash is a command-line tool that performs basic numeric/string/statistical computations. It is designed to be a *'data cruncher'*, operating on tabular input (typically text files) and providing summary statistics. Unlike tools like `awk` or `R`, Datamash focuses on grouped aggregations; it excels at performing calculations on a per-group basis, where groups are defined by fields in the input data. It supports operations like sum, min, max, mean, median, count, first, last, and more. Datamash is a versatile tool for data analysis, reporting, and data cleaning, especially when dealing with large datasets. It efficiently handles delimited text files, making it a powerful tool in data pipelines.
CAVEATS
Datamash assumes the input data is properly formatted and consistent. Errors can occur if the data contains unexpected characters or inconsistencies. Sorting is often necessary before grouping, and performance can degrade with very large unsorted datasets. It may not be suitable for complex statistical analyses.
EXIT STATUS
Datamash returns 0 on success, and non-zero on errors, such as incorrect command-line options, missing input data, or invalid data types.
FIELD NUMBERS
Field numbers are 1-based, not 0-based. Field numbers can be repeated in an operation to do multiple calculations against a field. Field ranges such as 2-4 are also supported.