In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce.
What is MapReduce?
MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time.
MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programming. The library where the procedure Map () and Reduce () belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop, which we will explore tomorrow.
Advantages of MapReduce Procedures
The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance. Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node.
As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines.
How do MapReduce Framework Works?
A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers.
There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node.
All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node.
This Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data).
This Framework has five different steps:
- Preparing Map() Input
- Executing User Provided Map() Code
- Shuffle Map Output to Reduce Processor
- Executing User Provided Reduce Code
- Producing the Final Output
Here is the Dataflow of MapReduce Framework:
- Input Reader
- Map Function
- Partition Function
- Compare Function
- Reduce Function
- Output Writer
In a future blog post of this 31 day series we will explore various components of this subject in Detail.
It is equivalent to SELECT and GROUP BY of a relational database for a very large database.
In tomorrow’s blog post we will discuss Buzz Word – HDFS. Stay tuned for the same.
Reference: Pinal Dave (https://blog.sqlauthority.com)
Pinal, glad to see your hold to new terms and technologies. Keep it up, I am encouraged!
Very well presented. I might have a very good use case coming close to the end of the 21 days. Pinal D. for President!
Your last line was very much impressing MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.
exactly that is where i actually understood what this was all about.
Great work .. awesome article… Kindly add few examples (real life scenario) for Mapreduce.
Really impressive lines “MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.”
“MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.” – this line makes the process more visualize as always we are in touch with these key word.
Tip: it would better to read if you link each post with previous and next.
Hi Pinal,in transactional replication whats the value i need to give for @schema_option in order to create the stored procedures for insert,update,delete at subscriber.
Thank you, THANK YOU; you explain it very easy and clearly. The last line makes every thing clear about mapreduce.
Thanks for your comment huda.
Pinal. I am a fresher and I have read multiple articles on Big Data. None was so clear and easy to understand. This 21days articles encourages me to read lot about the Technology. Thank you so much for this article. You Rock. Add me in ur fan list :P
After browsing through loads of BigData tutorials, and signing up for expensive BD course, I stumbled to this post which is so easy to understand for beginner. Great work.