Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21

October 23, 2013

Big Data, SQL, SQL Server, SQL Tips and Tricks

In yesterday’s blog post we learned the importance of the Pig and Pig Latin in Big Data Story. In this article we will understand what is Sqoop and Zookeeper in Big Data Story.

There are two most important components one should learn when learning about interacting with Hadoop – Sqoop and Zookper.

What is Sqoop?

Most of the business stores their data in RDBMS as well as other data warehouse solutions. They need a way to move data to the Hadoop system to do various processing and return it back to RDBMS from Hadoop system. The data movement can happen in real time or at various intervals in bulk. We need a tool which can help us move this data from SQL to Hadoop and from Hadoop to SQL. Sqoop (SQL to Hadoop) is such a tool which extract data from non-Hadoop data sources and transform them into the format which Hadoop can use it and later it loads them into HDFS. Essentially it is ETL tool where it Extracts, Transform and Load from SQL to Hadoop. The best part is that it also does extract data from Hadoop and loads them to Non-SQL (or RDBMS) data stores. Essentially, Sqoop is a command line tool which does SQL to Hadoop and Hadoop to SQL. It is a command line interpreter. It creates MapReduce job behinds the scene to import data from an external database to HDFS. It is very effective and easy to learn tool for nonprogrammers.

Big Data - Interacting with Hadoop - What is Sqoop? - What is Zookeeper? - Day 17 of 21 sqoop

What is Zookeeper?

Big Data - Interacting with Hadoop - What is Sqoop? - What is Zookeeper? - Day 17 of 21 zookeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In other words Zookeeper is a replicated synchronization service with eventual consistency. In simpler words – in Hadoop cluster there are many different nodes and one node is master. Let us assume that master node fails due to any reason. In this case, the role of the master node has to be transferred to a different node. The main role of the master node is managing the writers as that task requires persistence in order of writing. In this kind of scenario Zookeeper will assign new master node and make sure that Hadoop cluster performs without any glitch. Zookeeper is the Hadoop’s method of coordinating all the elements of these distributed systems. Here are few of the tasks which Zookeepr is responsible for.

Zookeeper manages the entire workflow of starting and stopping various nodes in the Hadoop’s cluster.
In Hadoop cluster when any processes need certain configuration to complete the task. Zookeeper makes sure that certain node gets necessary configuration consistently.
In case of the master node fails, Zookeepr can assign new master node and make sure cluster works as expected.

There many other tasks Zookeeper performance when it is about Hadoop cluster and communication. Basically without the help of Zookeeper it is not possible to design any new fault tolerant distributed application.