Big Data – Buzz Words: What is Hadoop – Day 6 of 21

October 8, 2013

Big Data, SQL, SQL Server, SQL Tips and Tricks

In yesterday’s blog post we learned what is NoSQL. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – Hadoop.

What is Hadoop?

Apache Hadoop is an open-source, free and Java based software framework offers a powerful distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs applications on large clusters of commodity hardware and it processes thousands of terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and high availability.

Big Data - Buzz Words: What is Hadoop - Day 6 of 21 hadoopbanner

What are the core components of Hadoop?

There are two major components of the Hadoop framework and both fo them does two of the important task for it.

Hadoop MapReduce is the method to split a larger data problem into smaller chunk and distribute it to many different commodity servers. Each server have their own set of resources and they have processed them locally. Once the commodity server has processed the data they send it back collectively to main server. This is effectively a process where we process large data effectively and efficiently. (We will understand this in tomorrow’s blog post).
Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference between any other file system and Hadoop. When we move a file on HDFS, it is automatically split into many small pieces. These small chunks of the file are replicated and stored on other servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day after tomorrow’s blog post).

Besides above two core components Hadoop project also contains following modules as well.

Hadoop Common: Common utilities for the other Hadoop modules
Hadoop Yarn: A framework for job scheduling and cluster resource management

There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will gradually explore in later blog posts.

A Multi-node Hadoop Cluster Architecture

Now let us quickly see the architecture of the a multi-node Hadoop cluster.

Big Data - Buzz Words: What is Hadoop - Day 6 of 21 hadooparchitecture

A small Hadoop cluster includes a single master node and multiple worker or slave node. As discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and another is of HDFS Layer. Each of these layer have its own relevant component. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or compute node. The matter of the fact that is the key feature of the Hadoop.

In this introductory blog post we will stop here while describing the architecture of Hadoop. In a future blog post of this 31 day series we will explore various components of Hadoop Architecture in Detail.

Why Use Hadoop?

There are many advantages of using Hadoop. Let me quickly list them over here:

Robust and Scalable – We can add new nodes as needed as well modify them.
Affordable and Cost Effective – We do not need any special hardware for running Hadoop. We can just use commodity server.
Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured and unstructured data.
Highly Available and Fault Tolerant – When a node fails, the Hadoop framework automatically fails over to another node.

Why Hadoop is named as Hadoop?

In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at Yahoo. Doug Cutting named Hadoop after his son’s toy elephant.

Tomorrow

In tomorrow’s blog post we will discuss Buzz Word – MapReduce.

Reference: Pinal Dave (https://blog.sqlauthority.com)

Big Data – Buzz Words: What is NoSQL – Day 5 of 21

SQLAuthority News – Mark the Date: October 16, 2013 – Introducing NuoDB Blackbirds: THE Distributed Database

No results found.

15 Comments. Leave new

Rushik
October 8, 2013 11:46 am
Hello Sir,
Its a nice introduction of hadoop to start with…
Reply
vidya vrat agarwal
October 8, 2013 12:40 pm
Oops! I guess you wanted to type HDFS instead of HDFC in the image.
Reply
Raza Syed
October 8, 2013 12:57 pm
Hard to digest for RDBMS guys, but informative!
Reply
Arun Ramachandran
October 8, 2013 2:50 pm
Nice & clear explanation of Hadoop. After reading this article, I understood that you are confident even of next one week’s posts :)
Reply
Sushil
October 8, 2013 8:01 pm
Too good Pinal..
i want to know one thing. As everyone saying Big data is the next big thing in data management.. Does it have potential to replace RDBMS? If yes ,how ?and if not, how these two(RDBMS and BIG DATA) are different?
Thanks in advance for your answers.
Reply
k hari krishna
October 8, 2013 10:21 pm
can we have any practical demo kind of thing
Reply
Selva
October 9, 2013 12:21 am
Very nice blog. will big data fully supported by MSSQL since Hadoop is java based, any one implemented bigdata using MSSQL?
Reply
Rajaraman
October 9, 2013 1:44 am
Good one Pinal, Are you planning to cover the testing side of Big Data as well?
– Rajaraman R
Reply
Dale Vivian Ross
October 9, 2013 12:26 pm
Thank you again. I wish I could fast forward and get to the end now, but maybe this is just what I need, slow and steady, then I’ll be ready.
Reply