LinuxCommandLibrary

csvstat

Summarize and analyze CSV data

TLDR

Show all stats for all columns

$ csvstat [data.csv]
copy

Show all stats for columns 2 and 4
$ csvstat [[-c|--columns]] [2,4] [data.csv]
copy

Show sums for all columns
$ csvstat --sum [data.csv]
copy

Show the max value length for column 3
$ csvstat [[-c|--columns]] [3] --len [data.csv]
copy

Show the number of unique values in the "name" column
$ csvstat [[-c|--columns]] [name] --unique [data.csv]
copy

SYNOPSIS

csvstat [OPTIONS] [FILE_OR_URL...]

PARAMETERS

-c , --columns
    Specify columns by name or 0-based index for analysis. Multiple columns can be comma-separated.

-n, --names
    Print column names and their 0-based indices, then exit.

--type
    Display the inferred data type for each column (e.g., Text, Number, Date).

--nulls
    Show the count and percentage of null values in each column.

--unique
    Show the count and percentage of unique values in each column.

--freq
    Display the N most frequent values and their frequencies (count and percentage) for each column.

--count
    Show the total number of non-null values for each column.

--min, --max
    Display the minimum and maximum values for numeric or date/time columns.

--sum, --mean, --median, --stdev
    Compute the sum, arithmetic mean, median, and standard deviation for numeric columns.

--q1, --q3, --qrange
    Calculate the first quartile (25th percentile), third quartile (75th percentile), and interquartile range (Q3 - Q1).

--len, --max-len
    Show the average and maximum string lengths of values in each column.

--all
    Display all available statistics for each column.

-H, --no-header-row
    Treat the first row of the input file as data rather than as a header row.

--encoding
    Specify the character encoding of the input CSV file (e.g., utf-8, latin-1).

--skip-lines
    Skip the first `COUNT` lines of the input file before processing.

DESCRIPTION

csvstat is a powerful command-line utility from the csvkit suite, designed for quick and comprehensive statistical analysis of CSV files. It helps users understand the structure and content of their tabular data without needing to import it into a spreadsheet or database.

csvstat can automatically infer data types and provide various statistics for each column, including the number of rows, null values, unique values, common values and their frequencies, minimum and maximum values, sum, mean, median, standard deviation, and more. It's an invaluable tool for data exploration, quality assurance, and initial data profiling, enabling users to rapidly identify outliers, missing data, and data distribution characteristics directly from the terminal.

CAVEATS

Data type inference can be imperfect, especially with mixed data types or ambiguous strings. Performance can degrade for very large datasets, particularly when calculating median or unique frequencies, as these operations may require loading significant data into memory. csvstat is primarily designed for well-formed CSVs; malformed files might lead to unexpected results or errors.

DATA PROFILING UTILITY

csvstat excels at quickly profiling datasets, offering immediate insights into data quality, completeness, and distribution. It's a crucial first step in any data analysis workflow, helping users to identify potential issues like missing values, inconsistent formats, or unexpected ranges before more complex processing begins.

HISTORY

csvkit, which includes csvstat, was created by Christopher Groskopf and first released in 2012. It emerged as a suite of command-line tools built on Python to simplify common CSV data manipulation tasks, providing a more robust and CSV-aware alternative to generic text processing tools for structured data. Its development has focused on ease of use, data type inference, and extensibility, becoming a popular choice for data journalists, analysts, and developers working with tabular data in a command-line environment.

SEE ALSO

csvlook(1), csvcut(1), csvgrep(1), csvjoin(1), datamash(1), awk(1)

Copied to clipboard