Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21

October 21, 2013

Big Data, SQL, SQL Server, SQL Tips and Tricks

In yesterday’s blog post we learned the importance of the operational database in Big Data Story. In this article we will understand what is Hive and HQL in Big Data Story.

Yahoo started working on PIG (we will understand that in the next blog post) for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. Similarly Facebook started deploying their warehouse solutions on Hadoop which has resulted in HIVE. The reason for going with HIVE is because the traditional warehousing solutions are getting very expensive.

What is HIVE?

Big Data - Data Mining with Hive - What is Hive? - What is HiveQL (HQL)? - Day 15 of 21 hive_logo Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there.

HIVE Organization

The data are organized in three different formats in HIVE.

Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.

Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.

Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.

Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here.

Big Data - Data Mining with Hive - What is Hive? - What is HiveQL (HQL)? - Day 15 of 21 hive-arch

What is HiveSQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily.

Create and manage tables and partitions
Support various Relational, Arithmetic and Logical Operators
Evaluate functions
Download the contents of a table to a local directory or result of queries to HDFS directory

Here is the example of the HQL Query:

SELECT upper(name), salesprice
FROM sales;
SELECT category, count(1) 
FROM products 
GROUP BY category;

When you look at the above query, you can see they are very similar to SQL like queries.

Tomorrow

In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Pig.

Reference: Pinal Dave (https://blog.sqlauthority.com)

SQL Scripts

Professional Development – Difference Between Bio, CV and Resume

Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21

3 Comments. Leave new

Sachin
May 6, 2015 5:47 pm
Good…i liked the post & also all data is very useful
Reply
- Pinal Dave
  May 8, 2015 1:24 pm
  Nice to hear that Sachin.
  Reply
Neeraj Aggarwal
May 22, 2016 1:23 am
Hi Pinal Dave, I just stumbled on your blog while searching to understand HIVE. I am a software testing professional and aspiring to have career in Big Data. Can you please spare couple for minutes for me to guide what and from where should i start?
Reply