In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce.
What is MapReduce?
MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time.
MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programming. The library where the procedure Map () and Reduce () belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop, which we will explore tomorrow.
Advantages of MapReduce Procedures
The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance. Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node.
As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines.
How do MapReduce Framework Works?
A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers.
There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node.
All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node.
This Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data).
This Framework has five different steps:
- Preparing Map() Input
- Executing User Provided Map() Code
- Shuffle Map Output to Reduce Processor
- Executing User Provided Reduce Code
- Producing the Final Output
Here is the Dataflow of MapReduce Framework:
- Input Reader
- Map Function
- Partition Function
- Compare Function
- Reduce Function
- Output Writer
In a future blog post of this 31 day series we will explore various components of this subject in Detail.
It is equivalent to SELECT and GROUP BY of a relational database for a very large database.
In tomorrow’s blog post we will discuss Buzz Word – HDFS. Stay tuned for the same.
Reference: Pinal Dave (http://blog.sqlauthority.com)