parquet-tools
Inspect and manipulate Parquet files
TLDR
Display the content of a Parquet file
Display the first few lines of a Parquet file
Print the schema of a Parquet file
Print the metadata of a Parquet file
Print the content and metadata of a Parquet file
Concatenate several Parquet files into the target one
Print the count of rows in a Parquet file
Print the column and offset indexes of a Parquet file
SYNOPSIS
parquet-tools
PARAMETERS
cat
Displays the data within the Parquet file in a textual format (e.g., JSON). Often used with options like `--max-rows`.
head
Displays the first few rows of the Parquet file. Useful for quickly sampling the data.
schema
Prints the schema of the Parquet file, including column names, data types, and nullability information.
meta
Displays the Parquet file's metadata, including statistics, compression algorithms, and other relevant details.
rowcount
Prints the number of rows present in the Parquet file.
dump
Dumps detailed internal information about the Parquet file.
--json
Outputs data or metadata in JSON format when used with `cat` or `meta`.
--max-rows
Limits the number of rows printed by the `cat` or `head` command.
--disable-color
Disables colored output in terminal.
--encoding
Specify output encoding.
DESCRIPTION
parquet-tools is a command-line utility designed for inspecting and manipulating Parquet files. It provides functionalities such as schema viewing, data content inspection, metadata retrieval, and data conversion. It's a valuable tool for developers, data engineers, and data scientists working with Parquet datasets.
The tool allows users to understand the structure of Parquet files, verify data integrity, and perform basic data exploration without relying on heavy-weight data processing frameworks like Spark or Hadoop. Common uses include confirming the schema of a file before ingesting it into a database, or quickly extracting a sample of data for testing purposes.
parquet-tools is typically distributed as part of the Apache Parquet project. It supports various data types and compression algorithms commonly used in Parquet files. Using parquet-tools, you can view the contents of specific columns, filter data based on certain conditions (to a limited extent through conversion), and get a high-level overview of the statistics stored within the Parquet file's metadata.
CAVEATS
parquet-tools is designed for relatively simple operations on Parquet files. It is not a replacement for full-fledged data processing engines like Spark or Hadoop for complex data transformations or analysis. Writing parquet files is not supported.
It is important to ensure that you have Java installed and configured correctly in your environment, as parquet-tools often relies on the Java runtime.
ERROR HANDLING
Pay close attention to error messages. Common issues include incorrect file paths, corrupted Parquet files, and missing Java dependencies. Verbose output often provides clues for troubleshooting.
USAGE EXAMPLES
Common tasks include extracting the schema:
parquet-tools schema mydata.parquet
Printing the first 10 rows as JSON:
parquet-tools cat --json --max-rows 10 mydata.parquet
HISTORY
parquet-tools emerged as part of the Apache Parquet project. As Parquet gained popularity as a storage format, the need for a simple command-line tool for inspection and basic manipulation became apparent. It has evolved alongside the Parquet format itself, adding support for new features and data types as they were introduced. The tool has proven invaluable for debugging, validation, and quick exploration of Parquet datasets across various environments.
SEE ALSO
hadoop(1), spark-submit(1)