Apache Hive - The Complete Advanced Guide

Apache Hive is an essential tool in the Hadoop ecosystem enabling SQL type access to data stored on HDFS, HBase and other storage layers of Hadoop. In this comprehensive guide, we will cover Hive in depth including its architecture, features, use cases, performance optimization and integration with other frameworks.

Overview

Hive provides a SQL processing layer on top of data stored in HDFS so that analysts and developers familiar with SQL can query the data. It translates queries written using Hive Query Language (HQL) into underlying MapReduce or Spark jobs for execution on the cluster.

Key capabilities:

Familiar SQL interface for data in Hadoop & NoSQL stores
Metadata management and schema on read
Highly optimized for large scale data processing
Very cost effective and scalable

We will explore all key areas of Hive in this guide from an advanced user perspective.

Diving Deep into Hive Architecture

The Hive architecture consists of various loosely coupled components for the end-to-end execution of SQL queries:

1. CLI/JDBC/ODBC Clients

Different types of client drivers allow submission of HiveQL queries to the Hive cluster. This includes Beeline, JDBC, ODBC etc.

2. Driver

Driver handles lifecycle management of query compilation, optimization and coordinated execution. It maintains sessions handles and logs.

3. Compiler

Performs semantic analysis, query rewriting and plans generation. Converts queries into directed acyclic graph of MapReduce or Spark jobs. Applies extensive rules based & cost based optimizations.

4. Metastore

The metastore is one of Hives most important components, it contains metadata about schema definitions, table definitions, partitions defined on those tables and the location of table data on HDFS along with other configuration information.

It is typically hosted on a relational database like MySQL, Postgres etc. for persistence. Having a central metadata repository enables sharing across multiple data processing engines like Spark, Impala etc.

5. Execution Engine

Responsible for executing the jobs generated by the compiler on Hadoop cluster – MapReduce or Tez or Spark execution engines. Interacts with JobTracker / YARN to orchestrate and monitor job execution.

Tez & Spark add DAG based execution avoiding MR overhead resulting in faster query execution. Tez also supports reusable containers lowering latency.

6. HCatalog

HCatalog provides storage management, a relational view and shared metadata access across various engines like Pig, MapReduce or custom MR programs. This enables interoperability through a common metadata store.

This simplified architecture shows interaction between the components:

![Hive architecture diagram]

Hive is designed to achieve separation of the compute layer for query execution and storage layer for data. This allows for loose coupling, avoids vendor lock-in and provides flexibility to plug different execution frameworks.

Next, we take a look at some of the advanced features provided by Hive.

Key Features

Hive incorporates many advanced features extending its capabilities for analytics on Hadoop cluster. Let‘s explore some of the popular ones:

SQL Access with HiveQL

Hive allows regular SQL developers access and analyze data on HDFS using a SQL style syntax called HiveQL. It hides the complexity of Java MapReduce programming and execution.

HiveQL covers typical SQL operations like SELECT, JOINs, Aggregations however Hive is read optimized only for data warehousing workloads.

JDBC/ODBC connectivity also allows existing BI tools or dashboards to operate over Hive accessed data without much change.

Indexing

Hive supports creation of indexes on tables which significantly improves look up performance while querying tables based on the index columns marked. Indexes make common queries faster.

Hive on Tez extends this by supporting automatic generation of indexes if required.

Partitioning

Partitioning breaks large datasets into more manageable parts along partition key columns like date, city etc. Making partitions smaller allows pruning of data to be read during query execution thus improving performance.

Partitioning works well when query filters contain partition column values reducing IO. Hive organizes tables into partitions through folder structure on HDFS.

Bucketing

Data in a Hive table can also be organized into buckets based on hash of a column. This clusters similar data together improving compression and optimizing query execution by performing parallel scans. Bucketing provides more structure to improve query optimization and planning.

ACID Support

Hive supports full ACID (Atomic, Consistent, Isolated, Durable) transactions since Hive 0.13. This brings Hive to parity with traditional RDBMS.

ACID allows concurrent readers and writers access to tables simultaneously without impacting data reliability and consistency critical for many enterprise use cases.

However ACID is not currently compatible with vectorized query execution which improves performance through batch processing.

Columnar Storage

Hive works very efficiently with columnar storage file formats like ORC or Parquet compared to traditional row formats like AVRO. Storing data column wise allows for huge IO improvements by reading only columns required for the SQL query instead of entire rows.

In benchmarks, ORC shows an average improvement of 5x over other file formats. ORC also supports indexes, bloom filters and lightweight data compression.

Cost Based Optimization

The Cost Based Optimizer (CBO) applies advanced rules to generate optimal query plans based on statistics collected for tables and partitions. It uses machine learning to tune performance over time.

Calcite framework provides the CBO optimization capabilities.

Low Latency Analytical Processing

LLAP (Low Latency Analytical Processing) is a set of enhancements that enable interactive SQL analytics on Hadoop by pre-caching data in memory. This avoids overhead of starting Java VMs and new containers improving concurrency and reducing query times to sub second.

LLAP uses persistent Query Execution Service daemon that caches data in memory. It uses additional tools for monitoring and management.

Use Cases

Let‘s look at some popular use cases where Hive brings significant productivity and performance improvements:

Data Warehousing

Hive is commonly used to enable data warehouse type workloads on Hadoop for business intelligence and analytics use cases. This includes:

Analyzing weblogs for visitor clickstream analysis
Processing financial trade transactions for reporting
Building enterprise data warehouses and data lakes

Hive makes the barrier to adoption easy by using SQL for analyzing big data sets compared to having to write complex MR programs.

Companies like Nasdaq use Hive for risk management analytics on trade data stored in Hadoop replicated across data centers.

ETL Processing

Hive is the go to tool for Hadoop based ETL (extract, transform, load) processing given its previous popularity for data warehousing use cases.

Cleaning, transforming and loading data into analytic databases requires handling many data types and sources which Hive can handle before loading into HDFS or Hive tables.

Key ETL capabilities provided by Hive:

Schemaless ETL processing through schema on read
Powerful SQL capabilities via HiveQL
Custom logic through UDFs
Partitioning for incremental ETL
Built in Input and Output formats

Companies like Spotify use Hive for their real time analytics pipeline.

Machine Learning

Hive is commonly used by data scientists and engineers for the preparing data for machine learning algorithm development. Key ways Hive facilitates machine learning:

Handling missing data and feature engineering at scale
Performing grid search hyperparameter tuning using Hive UDFs
Building machine learning feature pipelines
Scoring models and making predictions on large datasets

Query Federation

Hive queries can be configured to federate across different underlying storage engines. Eg. Unified metadata from HDFS and HBase data sources can be queried via same SQL interface.

This avoids having to directly work with lower level APIs of each engine and instead provide SQL access.

Performance & Optimization

There are multiple configuration, storage, execution and querying approaches that Hive allows to significantly boost performance and throughput.

Let‘s take a look at some popular techniques:

File Format

ORC vs Parquet – Use ORC or Parquet columnar formats for storage instead of row oriented Avro or text for compression and performance gains through buffering, encoding and indexing.

Execution Engine

Tez vs Spark – Leverage Tez or Spark as execution engine instead of MapReduce to improve latency and support DAG based execution avoiding MR overhead.

Indexing

Use automatic indexing in Hive on Tez so indexes are automatically created when beneficial to common query patterns.

Caching

Enable caching on Hive tables so result data is cached in memory avoiding compute. Sync cached data with underlying table data.

Partitioning

Partition tables on columns like datestamp, region etc so amount of data to scan is reduced based on query filters improving response times.

Bucketing

Hash bucket data to cluster similar data allowing optimization benefits through joins, sampling and parallelization.

Compression

Leverage ORC compression capabilities – ZLIB, SNAPPY and more for reduced IO and network traffic.

Cost Based Optimization

Let CBO choose optimal plan based on statistics not just heuristics. Turn on for all queries using SET hive.cbo.enable=true;

Integration

Hive integrates closely with various other ecosystems like Spark, NoSQL databases etc:

Spark

Hive can use Spark as its execution engine instead of MapReduce. This avoids overhead of MR and utilizes Spark‘s DAG engine for faster processing.

Spark SQL also leverages catalog from Hive eliminating metadata duplication and redundancy. Enabling HiveContext allows access to Hive UDFs and tabular formats like ORC.

HBase

Hive queries can directly be run on data stored in HBase NoSQL store by defining a table over existing column families and having custom SerDes configured do the translation.

HBase brings low latency access capabilities complementing Hive for batch processing use cases.

Druid

Druid is a high performance real-time analytics database designed for fast aggregations on time series data. Druid storage handler allows querying data stored in Druid clusters directly from Hive.

Presto

Facebook‘s Presto SQL engine can leverage Hive metastore for schema information and then directly query data at very low latencies and fast interactive response times.

Tools

Some useful tools that help improve productivity with Hive:

Beeline CLI

Beeline JDBC client enables remote access to HiveServer2 to executes queries, monitor status etc and view associated logs.

Hue

Hue provides a easy to use SQL editor accessible via browser that assists in querying Hive, allows query history and job diagnosis. Integrates with HiveServer2.

Apache Zeppelin

Zeppelin notebook allows creating interactive, shareable data analytic applications. It enables visualization and exploration of data in Hive on Hadoop cluster.

Ambari Hive View

Part of Hortonworks Ambari for managing Hadoop cluster. Provides a centralized view for querying, visualizing Hive metadata – databases, tables, partitions etc.

Learn Hive

Here are some useful resources to learn more about Hive:

Books

Tutorials

Video Courses

References

Conclusion

Hive has clearly evolved from its early days as a batch query engine to become comprehensive data warehousing solution over huge datasets stored in HDFS and HBase.

With its recent advancements like ACID support, LLAP caching, tight integrations with Spark and Tez – Hive delivers high scale, performant analytical and ETL capabilities crucial for enterprise big data strategy.

I hope this guide gave you a good expert level overview of Apache Hive ecosystem and provides a reference to apply in your own hadoop environments. Let me know in comments if you have any other questions!

Apache Hive – The Complete Advanced Guide