pig

Analyze large datasets using Hadoop

SYNOPSIS

pig [-f script_file] [-x execution_mode] [-p name=value]... [-D property=value]... [arguments...]

To launch the Grunt shell:
pig [-x execution_mode] [-D property=value]...

-x local|mapreduce|tez|spark
    Specifies the execution mode for Pig. `local` runs on the local filesystem, `mapreduce` uses Hadoop MapReduce, `tez` uses Apache Tez, and `spark` uses Apache Spark as the execution engine.

-f script_file
    Executes the Pig Latin script specified by `script_file`.

-p name=value
    Sets a parameter within the Pig script, allowing for dynamic values. Can be specified multiple times.

-D property=value
    Sets a system property for the Pig execution, overriding default configurations. Can be specified multiple times.

-l log_file
    Specifies the path for the Pig log file.

DESCRIPTION

The pig command is the primary interface for Apache Pig, a high-level platform designed for analyzing large datasets using Hadoop. It provides a powerful scripting language called Pig Latin, which abstracts the complexities of MapReduce programming. Users can leverage the pig command to execute Pig Latin scripts in batch mode, or to interactively explore and process data using the Grunt shell. Pig is particularly useful for ETL (Extract, Transform, Load) operations, data warehousing, and general data analysis on massive datasets stored in HDFS or other compatible file systems. It translates Pig Latin scripts into a series of MapReduce jobs, enabling scalable and fault-tolerant data processing across a Hadoop cluster. The command supports various execution modes, including local mode for development and testing, and Tez or Spark modes for production environments, offering flexibility in how data transformations are performed.

CAVEATS

The pig command requires a Java Runtime Environment (JRE) and typically a Hadoop installation to function correctly, especially for distributed execution modes. Its performance is heavily dependent on the underlying Hadoop cluster configuration and resource availability. Debugging Pig scripts can sometimes be challenging due to the compilation into MapReduce jobs.

<I>GRUNT SHELL</I>

When the pig command is invoked without a script file (e.g., `pig -x local`), it launches the Grunt shell. This interactive command-line interface allows users to execute Pig Latin commands one by one, test queries, and inspect data. It's an invaluable tool for prototyping and debugging Pig scripts.

<I>PIG LATIN</I>

Pig Latin is the high-level data flow language used by Apache Pig. It provides relational-style operators like JOIN, GROUP, ORDER BY, and FILTER, as well as functions for data manipulation. Pig Latin scripts are compiled into MapReduce jobs, or jobs for other execution engines, by the Pig framework.

HISTORY

Apache Pig was developed at Yahoo! in 2006 to provide a higher-level abstraction for writing and optimizing MapReduce programs on Hadoop. It was open-sourced to the Apache Software Foundation in 2007 and became a top-level Apache project in 2009. The initial motivation was to enable researchers and analysts to quickly process large datasets without needing to write complex Java MapReduce code, focusing on data flow instead. Its usage became widespread in big data analytics environments before the rise of Spark and other frameworks.

pig