hive

Execute SQL-like queries on Hadoop data

TLDR

Start a Hive interactive shell

$ hive

Run HiveQL

$ hive -e "[hiveql_query]"

Run a HiveQL file with a variable substitution

$ hive [[-d|--define]] [key]=[value] -f [path/to/file.sql]

Run a HiveQL with HiveConfig (e.g. mapred.reduce.tasks=32)

$ hive --hiveconf [conf_name]=[conf_value]

SYNOPSIS

hive [options] [-e <query_string> | -f <filepath>]
Common usage involves either interactive mode (running hive without arguments to enter the shell) or executing a single query/script.

-e <quoted_query_string>
 Execute a single HiveQL command. The query string must be quoted.

-f <filepath>
 Execute HiveQL commands from the specified file.

-H, --help
 Display help information and exit.

-i <filepath>
 Initialization script. Execute HiveQL commands from this file before entering interactive mode or executing the main query/script.

-S, --silent
 Run in silent mode, suppressing progress and informational messages to output only query results.

-v, --verbose
 Run in verbose mode, displaying detailed execution information.

-p, --print-header
 Print the column headers in the output for query results.

--hiveconf <property=value>
 Set a Hive configuration property. Can be used multiple times. Also aliased by -D.

--hivevar <variable=value>
 Define a Hive variable that can be used within HiveQL scripts (e.g., ${hivevar:myvar}).

-database <database_name>
 Specify the database to use upon starting the CLI.

--version
 Print the Hive version and exit.

DESCRIPTION

The hive command is the primary command-line interface (CLI) for interacting with Apache Hive. Hive is a data warehouse software built on top of Apache Hadoop that provides a SQL-like query language called HiveQL for querying and managing large datasets stored in distributed storage systems like HDFS. The hive CLI allows users to submit HiveQL queries, manage tables and databases, and execute scripts, effectively abstracting the complexities of underlying MapReduce, Tez, or Spark jobs. It translates HiveQL statements into distributed computations, enabling data analysts and developers to leverage SQL knowledge for big data analysis without needing to write complex Java code for Hadoop. This makes Hive a powerful tool for ETL processes, data warehousing, and ad-hoc queries over massive datasets.

CAVEATS

Performance Overhead: Due to its reliance on Hadoop for execution, hive can incur significant startup and execution overhead, making it less suitable for low-latency queries or small datasets compared to traditional RDBMS.
Resource Intensive: Hive queries consume cluster resources (CPU, memory, disk I/O) proportional to the data size and query complexity.
Schema-on-Read: Hive enforces schema at query time rather than load time, which offers flexibility but can lead to data type errors if data doesn't conform to the defined schema.
Not OLTP: Hive is designed for Online Analytical Processing (OLAP) and batch processing, not transactional workloads (OLTP). It lacks typical RDBMS features like row-level updates and real-time inserts for efficiency.

INTERACTIVE MODE

To enter the interactive Hive shell, simply run hive without any arguments. From the shell, you can type HiveQL queries directly, which are executed upon pressing Enter and ending with a semicolon.

SCRIPT EXECUTION

For batch processing or repeated tasks, it is common to put HiveQL commands into a .hql file and execute it using hive -f my_script.hql. This method is efficient for automated jobs.

CONFIGURATION MANAGEMENT

Hive's behavior can be extensively configured via hive-site.xml or by passing --hiveconf (or -D) arguments on the command line. These properties control everything from query optimization to resource allocation.

HISTORY

Apache Hive originated at Facebook in 2007 as a project to provide a SQL-like interface for ad-hoc queries and analysis over their massive datasets stored in Hadoop Distributed File System (HDFS). It was open-sourced in 2008 and quickly became an Apache Top-Level Project. Initially, Hive relied solely on MapReduce as its execution engine. Over time, it evolved to support more efficient engines like Apache Tez and Apache Spark, significantly improving query performance. Its development focused on making big data accessible to analysts familiar with SQL, abstracting away the complexities of distributed programming.

hive

Execute SQL-like queries on Hadoop data

TLDR

SYNOPSIS

PARAMETERS

DESCRIPTION

CAVEATS

<B>INTERACTIVE MODE</B>

<B>SCRIPT EXECUTION</B>

<B>CONFIGURATION MANAGEMENT</B>

HISTORY

SEE ALSO