Big Data – Buzz Words: What is MapReduce – Day 7 of 21

In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce.

What is MapReduce?

MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time.

MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programing. The library where procedure Map() and Reduce() belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop which we will explore tomorrow.

Advantages of MapReduce Procedures

The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance. Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node.

As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines.

How do MapReduce Framework Works?

A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers.

Map() Procedure

There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node.

Reduce() Procedure

All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node.

The MapReduce Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data).

The MapReduce Framework has five different steps:

  • Preparing Map() Input
  • Executing User Provided Map() Code
  • Shuffle Map Output to Reduce Processor
  • Executing User Provided Reduce Code
  • Producing the Final Output

Here is the Dataflow of MapReduce Framework:

  • Input Reader
  • Map Function
  • Partition Function
  • Compare Function
  • Reduce Function
  • Output Writer

In a future blog post of this 31 day series we will explore various components of MapReduce in Detail.

MapReduce in a Single Statement

MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.

Tomorrow

In tomorrow’s blog post we will discuss Buzz Word – HDFS.

Reference: Pinal Dave (http://blog.sqlauthority.com)

About these ads

8 thoughts on “Big Data – Buzz Words: What is MapReduce – Day 7 of 21

  1. Very well presented. I might have a very good use case coming close to the end of the 21 days. Pinal D. for President!

  2. Really impressive lines “MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.”

  3. “MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.” – this line makes the process more visualize as always we are in touch with these key word.
    Tip: it would better to read if you link each post with previous and next.

    Thanks again,
    Suman

  4. Hi Pinal,in transactional replication whats the value i need to give for @schema_option in order to create the stored procedures for insert,update,delete at subscriber.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s