HIVE : A Warehousing Tool

Hive is basically a Data Warehouse Infrastructure Tool, which is used for processing structured data in Hadoop. Primarily used to summarize and manage Big Data, Hive helps make querying and analyzing easy. Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Hive is a powerful tool for ETL, It is, however, relatively slow compared with traditional databases. It doesn’t offer all the same SQL features or even the same database features as traditional databases. But it does support SQL, it does function as a database, and it gives access to Hadoop technology to more people (even those who are not programmers). It offers a way to transform unstructured and semi-structured data into usable schema-based data.

History of Hive

A little history about Apache Hive will help you understand why it came into existence. When Facebook started gathering data and ingesting it into Hadoop, the data was coming in at the rate of tens of GBs per day back in 2006. Then, in 2007, it grew to 1TB/day and within a few years increased to around 15TBs/day. Initially, Python scripts were written to ingest the data in Oracle databases, but with the increasing data rate and also the diversity in the sources/types of incoming data, this was becoming difficult. The Oracle instances were getting filled pretty fast and it was time to develop a new kind of system that handled large amounts of data. It was Facebook that first built Hive, so that most people who had SQL skills could use the new system with minimal changes, compared to what was required with other RDBMs.

The main features of Hive are:

  • Hive provides data summarization, query, and analysis in much easier manner.
  • Hive supports external tables which make it possible to process data without actually storing in HDFS.
  • Apache Hive fits the low-level interface requirement of Hadoop perfectly.
  • It also supports partitioning of data at the level of tables to improve performance.
  • Hive has a rule based optimizer for optimizing logical plans.
  • It is scalable, familiar, and extensible.
  • Using HiveQL doesn’t require any knowledge of programming language, Knowledge of basic SQL query is enough.
  • We can easily process structured data in Hadoop using Hive.
  • Querying in Hive is very simple as it is similar to SQL.
  • We can also run Ad-hoc queries for the data analysis using Hive.

 

The importance of Hive in Hadoop

Apache Hive lets you work with Hadoop in a very efficient manner. It is a complete data warehouse infrastructure that is built on top of the Hadoop framework. Hive is uniquely placed to query data, and perform powerful analysis and data summarisation while working with large volumes of data. An integral part of Hive is the HiveQL query, which is an SQL-like interface that is used extensively to query what is stored in databases.

Hive has the distinct advantage of deploying high-speed data reads and writes within the data warehouses while managing large data sets that are distributed across multiple locations, all thanks to its SQL-like features. It provides a structure to the data that is already stored in the database. The users are able to connect with Hive using a command line tool and a JDBC driver.

Apache Hive also possess some of the disadvantages, which are very important from a learner’s point of view.

Some of them are:

  • It does not offer real-time queries.
  • It does not offer row-level update
  • Provides acceptable latency for interactive data browsing.
  • Sub-queries are not supported in Hive
  • Latency for Apache Hive queries is generally very high.
  • Not designed for Online Transaction Processing
  • Supports overwriting or apprehending data but not updates and deletes.

However, it will be good if you have a complete knowledge on Apache Hive, for which you can refer to Hive Tutorial by Edureka.

You can Also refer What is Hive ?

You should have understanding of Hadoop as well for a good grip on Hive. Edureka provides a good playlist of Hadoop tutorial videos as well as Hadoop tutorial blog series.

Following are the books for learning Apache Hive:

  • Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen – O’Reilly
  • Apache Hive Essentials by Dayong Du – Packt Publishing