LinuxCommandLibrary

parquet

TLDR

Show file schema

$ parquet-tools schema [file.parquet]
copy
Show metadata
$ parquet-tools meta [file.parquet]
copy
Show first rows
$ parquet-tools head [file.parquet]
copy
Convert to JSON
$ parquet-tools cat --json [file.parquet]
copy
Show row count
$ parquet-tools rowcount [file.parquet]
copy
Merge files
$ parquet-tools merge [file1.parquet] [file2.parquet] [output.parquet]
copy

SYNOPSIS

parquet-tools command [options] file

DESCRIPTION

Parquet is a columnar storage format for big data. parquet-tools (or parquet-cli) inspects and manipulates Parquet files, showing schema, metadata, and contents.
Parquet provides efficient compression and encoding for analytics workloads.

PARAMETERS

schema

Show schema.
meta
Show metadata.
head
Show first rows.
cat
Output all rows.
rowcount
Count rows.
merge
Merge files.
--json
JSON output.
-n num
Number of rows.

PARQUET FEATURES

$ - Columnar storage
- Schema embedded
- Compression (Snappy, GZIP, etc.)
- Predicate pushdown
- Nested data support
copy

PYTHON ALTERNATIVE

$ import pyarrow.parquet as pq
table = pq.read_table('file.parquet')
print(table.schema)
copy

CAVEATS

Java-based tools require JVM. Consider pyarrow for Python workflows. Large files need memory.

HISTORY

Apache Parquet was created as collaboration between Twitter and Cloudera in 2013 for efficient big data storage.

SEE ALSO

Copied to clipboard