Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21

In yesterday’s blog post we learned the importance of the Pig and Pig Latin in Big Data Story. In this article we will understand what is Sqoop and Zookeeper in Big Data Story.

There are two most important components one should learn when learning about interacting with Hadoop – Sqoop and Zookper.

What is Sqoop?

Most of the business stores their data in RDBMS as well as other data warehouse solutions. They need a way to move data to the Hadoop system to do various processing and return it back to RDBMS from Hadoop system. The data movement can happen in real time or at various intervals in bulk. We need a tool which can help us move this data from SQL to Hadoop and from Hadoop to SQL. Sqoop (SQL to Hadoop) is such a tool which extract data from non-Hadoop data sources and transform them into the format which Hadoop can use it and later it loads them into HDFS. Essentially it is ETL tool where it Extracts, Transform and Load from SQL to Hadoop. The best part is that it also does extract data from Hadoop and loads them to Non-SQL (or RDBMS) data stores. Essentially, Sqoop is a command line tool which does SQL to Hadoop and Hadoop to SQL. It is a command line interpreter. It creates MapReduce job behinds the scene to import data from an external database to HDFS. It is very effective and easy to learn tool for nonprogrammers.

sqoop Big Data   Interacting with Hadoop   What is Sqoop?   What is Zookeeper?   Day 17 of 21

What is Zookeeper?

zookeeper Big Data   Interacting with Hadoop   What is Sqoop?   What is Zookeeper?   Day 17 of 21ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In other words Zookeeper is a replicated synchronization service with eventual consistency. In simpler words – in Hadoop cluster there are many different nodes and one node is master. Let us assume that master node fails due to any reason. In this case, the role of the master node has to be transferred to a different node. The main role of the master node is managing the writers as that task requires persistence in order of writing. In this kind of scenario Zookeeper will assign new master node and make sure that Hadoop cluster performs without any glitch. Zookeeper is the Hadoop’s method of coordinating all the elements of these distributed systems. Here are few of the tasks which Zookeepr is responsible for.

  • Zookeeper manages the entire workflow of starting and stopping various nodes in the Hadoop’s cluster.
  • In Hadoop cluster when any processes need certain configuration to complete the task. Zookeeper makes sure that certain node gets necessary configuration consistently.
  • In case of the master node fails, Zookeepr can assign new master node and make sure cluster works as expected.

There many other tasks Zookeeper performance when it is about Hadoop cluster and communication. Basically without the help of Zookeeper it is not possible to design any new fault tolerant distributed application.


In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Big Data Analytics.

Reference: Pinal Dave (

2 thoughts on “Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21

  1. Pingback: Interview Question of the Week #022 – How to Get Started with Big Data? | Journey to SQL Authority with Pinal Dave

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s