parquet-tools

Inspect and manipulate Parquet files

TLDR

Display the content of a Parquet file

$ parquet-tools cat [path/to/parquet]

Display the first few lines of a Parquet file

$ parquet-tools head [path/to/parquet]

Print the schema of a Parquet file

$ parquet-tools schema [path/to/parquet]

Print the metadata of a Parquet file

$ parquet-tools meta [path/to/parquet]

Print the content and metadata of a Parquet file

$ parquet-tools dump [path/to/parquet]

Concatenate several Parquet files into the target one

$ parquet-tools merge [path/to/parquet1] [path/to/parquet2] [path/to/target_parquet]

Print the count of rows in a Parquet file

$ parquet-tools rowcount [path/to/parquet]

Print the column and offset indexes of a Parquet file

$ parquet-tools column-index [path/to/parquet]

cat
    Reads and prints the content of a Parquet file to standard output. Useful for viewing the data rows.

schema
    Displays the schema of a Parquet file. Shows the column names, types, and nullability.

meta
    Prints detailed metadata of a Parquet file, including information about row groups, columns, and data pages. Essential for debugging.

head
    Displays the first N records of a Parquet file. Similar to the Unix `head` command. Default is 10 records.

show
    Displays records from a Parquet file, often with more structured output options (e.g., JSON).

dump
    Dumps the raw, internal structure of a Parquet file for deep-level debugging.

merge
    Merges multiple Parquet files into a single output file.

DESCRIPTION

`parquet-tools` is a command-line utility designed for interacting with Apache Parquet files. It provides various subcommands to inspect, debug, and understand the structure and content of these columnar data files. Users can view schemas, metadata, read data, and perform basic operations, making it an invaluable tool for developers and data engineers working with big data ecosystems that utilize Parquet as a storage format, such as Apache Spark, Hive, or Impala. It helps in validating data, troubleshooting schema evolution issues, and quickly peeking into file contents without needing to load them into a full data processing framework.

CAVEATS

Java Dependency: `parquet-tools` typically requires a Java Runtime Environment (JRE) to be installed on the system, as it's often distributed as a Java JAR application (e.g., invoked via `java -jar parquet-tools-.jar`). A wrapper script named `parquet-tools` often makes it appear as a standard command.
Memory Usage: For very large Parquet files, especially when using subcommands like `cat` or `show` without limits, the tool might consume significant memory or take a long time to process.
No Modification: The primary `parquet-tools` suite is designed for inspection and debugging; it does not offer robust capabilities for modifying or writing Parquet files in place. For such operations, a data processing framework like Apache Spark or PyArrow/Pandas is usually required.

COMMON SUBCOMMAND OPTIONS

Many subcommands support options to control their behavior. For example:
- `parquet-tools head -n 20 `: Show the first 20 records. (`-n` for number of records)
- `parquet-tools cat --json `: Output data in JSON format (available for `cat`, `show`).
- `parquet-tools schema --json `: Output schema in JSON format.
- `parquet-tools show -r `: Show raw data without decoding (for debug).

FILE PATHS

`parquet-tools` can often read files directly from the local filesystem or, if configured, from distributed file systems like HDFS by providing the full HDFS path (e.g., `hdfs:///user/data/file.parquet`).

INSTALLATION

`parquet-tools` is typically downloaded as a standalone JAR from Apache mirrors or can be built from source. Some distributions might package a wrapper script for easier invocation.

HISTORY

`parquet-tools` emerged as a part of the Apache Parquet project, which started in 2013 as a columnar storage format optimized for analytical queries. As Parquet gained widespread adoption in the big data ecosystem (especially with Apache Spark, Hive, and Impala), the need for a simple command-line utility to inspect and debug these files became evident. `parquet-tools` fills this gap, providing a quick way to examine file structure and content without requiring a full-fledged data processing framework to be spun up. Its development has been driven by the community around the Apache Parquet project to support its growing usage.