datamash
Perform basic numeric/string grouping calculations
TLDR
Get max, min, mean, and median of a single column of numbers
Get the mean of a single column of float numbers (floats must use "," and not ".")
Get the mean of a single column of numbers with a given decimal precision
Get the mean of a single column of numbers ignoring "Na" and "NaN" (literal) strings
SYNOPSIS
datamash [OPTION...] [OPERATION...] [FILE...]
PARAMETERS
-t, --field-separator=CHAR
Use CHAR as input fields delimiter (default: whitespace)
-g, --group[=K]
Group by field K (repeatable for multiple keys)
--groups[=K]
Alias for --group
-s, --sort
Sort input before grouping (implies --unique)
--narm
Skip NaN/infinite values ('Not Available' removal)
-H, --header-in
Treat first line as header, skip in grouping
--header-out
Print header line in output
-R, --round-digits=N
Round floats to N decimal places
--full
Operate on full lines, not fields
--collate
Collate groups horizontally instead of vertically
--pipe
Use pipes for streaming multi-operation processing
--json
Output in JSON format
-h, --help
Display help and exit
-V, --version
Output version info and exit
DESCRIPTION
Datamash is a versatile command-line utility for performing statistical computations on tabular data. It reads input from files or standard input, treating lines as records and whitespace-separated (or custom delimiter) fields within them. Users specify operations like sum, mean, min, max, median, standard deviation, and more, optionally grouping results by one or more keys.
It excels at aggregating data efficiently without needing scripting languages like awk or Perl. For example, compute the average of column 2 grouped by column 1. Supports sorting input for unique counts, header handling, and numeric rounding. Ideal for data analysis pipelines, CSV/TSV processing, and quick stats on logs or datasets.
Unlike full-featured tools, datamash focuses on speed and simplicity for common aggregations, handling large inputs via streaming.
CAVEATS
Does not parse quoted CSV fields; assumes uniform delimiters. Grouping requires sorted input for some ops unless -s used. Limited to predefined operations—no custom functions.
COMMON OPERATIONS
sum FIELD, mean FIELD, min FIELD, max FIELD, median FIELD, count FIELD, unique FIELD, sd FIELD (std dev), first FIELD, last FIELD. Syntax: op field>output-field for renaming.
Example: datamash -g 1 mean 2
EXAMPLE USAGE
datamash sum 2 < data.txt sums column 2.
datamash -t, -g 1 -s count 1 < csvfile.csv counts unique groups by col1.
HISTORY
Developed by Assaf Gordon; first released in 2014 as a GNU project to provide efficient, standalone stats without scripting dependencies. Integrated into many distros post-2015.


