hive
Execute SQL-like queries on Hadoop data
TLDR
Start a Hive interactive shell
Run HiveQL
Run a HiveQL file with a variable substitution
Run a HiveQL with HiveConfig (e.g. mapred.reduce.tasks=32)
SYNOPSIS
hive [options] [-e <query_string> | -f <filepath>]
Common usage involves either interactive mode (running hive without arguments to enter the shell) or executing a single query/script.
PARAMETERS
-e <quoted_query_string>
Execute a single HiveQL command. The query string must be quoted.
-f <filepath>
Execute HiveQL commands from the specified file.
-H, --help
Display help information and exit.
-i <filepath>
Initialization script. Execute HiveQL commands from this file before entering interactive mode or executing the main query/script.
-S, --silent
Run in silent mode, suppressing progress and informational messages to output only query results.
-v, --verbose
Run in verbose mode, displaying detailed execution information.
-p, --print-header
Print the column headers in the output for query results.
--hiveconf <property=value>
Set a Hive configuration property. Can be used multiple times. Also aliased by -D.
--hivevar <variable=value>
Define a Hive variable that can be used within HiveQL scripts (e.g., ${hivevar:myvar}).
-database <database_name>
Specify the database to use upon starting the CLI.
--version
Print the Hive version and exit.
DESCRIPTION
The hive command is the primary command-line interface (CLI) for interacting with Apache Hive. Hive is a data warehouse software built on top of Apache Hadoop that provides a SQL-like query language called HiveQL for querying and managing large datasets stored in distributed storage systems like HDFS. The hive CLI allows users to submit HiveQL queries, manage tables and databases, and execute scripts, effectively abstracting the complexities of underlying MapReduce, Tez, or Spark jobs. It translates HiveQL statements into distributed computations, enabling data analysts and developers to leverage SQL knowledge for big data analysis without needing to write complex Java code for Hadoop. This makes Hive a powerful tool for ETL processes, data warehousing, and ad-hoc queries over massive datasets.
CAVEATS
Performance Overhead: Due to its reliance on Hadoop for execution, hive can incur significant startup and execution overhead, making it less suitable for low-latency queries or small datasets compared to traditional RDBMS.
Resource Intensive: Hive queries consume cluster resources (CPU, memory, disk I/O) proportional to the data size and query complexity.
Schema-on-Read: Hive enforces schema at query time rather than load time, which offers flexibility but can lead to data type errors if data doesn't conform to the defined schema.
Not OLTP: Hive is designed for Online Analytical Processing (OLAP) and batch processing, not transactional workloads (OLTP). It lacks typical RDBMS features like row-level updates and real-time inserts for efficiency.
<B>INTERACTIVE MODE</B>
To enter the interactive Hive shell, simply run hive without any arguments. From the shell, you can type HiveQL queries directly, which are executed upon pressing Enter and ending with a semicolon.
<B>SCRIPT EXECUTION</B>
For batch processing or repeated tasks, it is common to put HiveQL commands into a .hql file and execute it using hive -f my_script.hql. This method is efficient for automated jobs.
<B>CONFIGURATION MANAGEMENT</B>
Hive's behavior can be extensively configured via hive-site.xml or by passing --hiveconf (or -D) arguments on the command line. These properties control everything from query optimization to resource allocation.
HISTORY
Apache Hive originated at Facebook in 2007 as a project to provide a SQL-like interface for ad-hoc queries and analysis over their massive datasets stored in Hadoop Distributed File System (HDFS). It was open-sourced in 2008 and quickly became an Apache Top-Level Project. Initially, Hive relied solely on MapReduce as its execution engine. Over time, it evolved to support more efficient engines like Apache Tez and Apache Spark, significantly improving query performance. Its development focused on making big data accessible to analysts familiar with SQL, abstracting away the complexities of distributed programming.
SEE ALSO
hadoop(1), hdfs(1), beeline(1), spark-sql(1)