LinuxCommandLibrary

parquet-tools

Inspect and manipulate Parquet files

TLDR

Display the content of a Parquet file

$ parquet-tools cat [path/to/parquet]
copy

Display the first few lines of a Parquet file
$ parquet-tools head [path/to/parquet]
copy

Print the schema of a Parquet file
$ parquet-tools schema [path/to/parquet]
copy

Print the metadata of a Parquet file
$ parquet-tools meta [path/to/parquet]
copy

Print the content and metadata of a Parquet file
$ parquet-tools dump [path/to/parquet]
copy

Concatenate several Parquet files into the target one
$ parquet-tools merge [path/to/parquet1] [path/to/parquet2] [path/to/target_parquet]
copy

Print the count of rows in a Parquet file
$ parquet-tools rowcount [path/to/parquet]
copy

Print the column and offset indexes of a Parquet file
$ parquet-tools column-index [path/to/parquet]
copy

SYNOPSIS

parquet-tools [options]

PARAMETERS

cat
    Displays the data within the Parquet file in a textual format (e.g., JSON). Often used with options like `--max-rows`.

head
    Displays the first few rows of the Parquet file. Useful for quickly sampling the data.

schema
    Prints the schema of the Parquet file, including column names, data types, and nullability information.

meta
    Displays the Parquet file's metadata, including statistics, compression algorithms, and other relevant details.

rowcount
    Prints the number of rows present in the Parquet file.

dump
    Dumps detailed internal information about the Parquet file.

--json
    Outputs data or metadata in JSON format when used with `cat` or `meta`.

--max-rows
    Limits the number of rows printed by the `cat` or `head` command.

--disable-color
    Disables colored output in terminal.

--encoding
    Specify output encoding.

DESCRIPTION

parquet-tools is a command-line utility designed for inspecting and manipulating Parquet files. It provides functionalities such as schema viewing, data content inspection, metadata retrieval, and data conversion. It's a valuable tool for developers, data engineers, and data scientists working with Parquet datasets.

The tool allows users to understand the structure of Parquet files, verify data integrity, and perform basic data exploration without relying on heavy-weight data processing frameworks like Spark or Hadoop. Common uses include confirming the schema of a file before ingesting it into a database, or quickly extracting a sample of data for testing purposes.

parquet-tools is typically distributed as part of the Apache Parquet project. It supports various data types and compression algorithms commonly used in Parquet files. Using parquet-tools, you can view the contents of specific columns, filter data based on certain conditions (to a limited extent through conversion), and get a high-level overview of the statistics stored within the Parquet file's metadata.

CAVEATS

parquet-tools is designed for relatively simple operations on Parquet files. It is not a replacement for full-fledged data processing engines like Spark or Hadoop for complex data transformations or analysis. Writing parquet files is not supported.
It is important to ensure that you have Java installed and configured correctly in your environment, as parquet-tools often relies on the Java runtime.

ERROR HANDLING

Pay close attention to error messages. Common issues include incorrect file paths, corrupted Parquet files, and missing Java dependencies. Verbose output often provides clues for troubleshooting.

USAGE EXAMPLES

Common tasks include extracting the schema:
parquet-tools schema mydata.parquet

Printing the first 10 rows as JSON:
parquet-tools cat --json --max-rows 10 mydata.parquet

HISTORY

parquet-tools emerged as part of the Apache Parquet project. As Parquet gained popularity as a storage format, the need for a simple command-line tool for inspection and basic manipulation became apparent. It has evolved alongside the Parquet format itself, adding support for new features and data types as they were introduced. The tool has proven invaluable for debugging, validation, and quick exploration of Parquet datasets across various environments.

SEE ALSO

hadoop(1), spark-submit(1)

Copied to clipboard