parquet-tools
Inspect and manipulate Parquet files
TLDR
Display the content of a Parquet file
Display the first few lines of a Parquet file
Print the schema of a Parquet file
Print the metadata of a Parquet file
Print the content and metadata of a Parquet file
Concatenate several Parquet files into the target one
Print the count of rows in a Parquet file
Print the column and offset indexes of a Parquet file
SYNOPSIS
`parquet-tools
PARAMETERS
cat
Reads and prints the content of a Parquet file to standard output. Useful for viewing the data rows.
schema
Displays the schema of a Parquet file. Shows the column names, types, and nullability.
meta
Prints detailed metadata of a Parquet file, including information about row groups, columns, and data pages. Essential for debugging.
head
Displays the first N records of a Parquet file. Similar to the Unix `head` command. Default is 10 records.
show
Displays records from a Parquet file, often with more structured output options (e.g., JSON).
dump
Dumps the raw, internal structure of a Parquet file for deep-level debugging.
merge
Merges multiple Parquet files into a single output file.
DESCRIPTION
`parquet-tools` is a command-line utility designed for interacting with Apache Parquet files. It provides various subcommands to inspect, debug, and understand the structure and content of these columnar data files. Users can view schemas, metadata, read data, and perform basic operations, making it an invaluable tool for developers and data engineers working with big data ecosystems that utilize Parquet as a storage format, such as Apache Spark, Hive, or Impala. It helps in validating data, troubleshooting schema evolution issues, and quickly peeking into file contents without needing to load them into a full data processing framework.
CAVEATS
Java Dependency: `parquet-tools` typically requires a Java Runtime Environment (JRE) to be installed on the system, as it's often distributed as a Java JAR application (e.g., invoked via `java -jar parquet-tools-
Memory Usage: For very large Parquet files, especially when using subcommands like `cat` or `show` without limits, the tool might consume significant memory or take a long time to process.
No Modification: The primary `parquet-tools` suite is designed for inspection and debugging; it does not offer robust capabilities for modifying or writing Parquet files in place. For such operations, a data processing framework like Apache Spark or PyArrow/Pandas is usually required.
COMMON SUBCOMMAND OPTIONS
Many subcommands support options to control their behavior. For example:
- `parquet-tools head -n 20
- `parquet-tools cat --json
- `parquet-tools schema --json
- `parquet-tools show -r
FILE PATHS
`parquet-tools` can often read files directly from the local filesystem or, if configured, from distributed file systems like HDFS by providing the full HDFS path (e.g., `hdfs:///user/data/file.parquet`).
INSTALLATION
`parquet-tools` is typically downloaded as a standalone JAR from Apache mirrors or can be built from source. Some distributions might package a wrapper script for easier invocation.
HISTORY
`parquet-tools` emerged as a part of the Apache Parquet project, which started in 2013 as a columnar storage format optimized for analytical queries. As Parquet gained widespread adoption in the big data ecosystem (especially with Apache Spark, Hive, and Impala), the need for a simple command-line utility to inspect and debug these files became evident. `parquet-tools` fills this gap, providing a quick way to examine file structure and content without requiring a full-fledged data processing framework to be spun up. Its development has been driven by the community around the Apache Parquet project to support its growing usage.