Desktop version

Home arrow Computer Science arrow Applied Big Data Analytics in Operations Management



MapReduce is a programming model which is used for processing and generating large data sets in form of pair. There are two kinds of functions used in MapReduce. A map function is specified which processes data in form of pair and produces a set of all intermediate pairs. The job of reduce function is to combine all the intermediate values which are related to same intermediate key after the processing of map function (Dean & Ghemawat, 2008) and (Lammel, 2008).

MapReduce is a program that consists of map() and reduce() function. map() function performs filtering and sorting over data, for example, sorting the names of students into queues. reduce() function carries the summary operation over the resultant data, for example, counting the number of students that are present in each row (Lammel, 2008). The “MapReduce” system gives its best performance over a distributed network, performing several tasks in parallel, managing data and control transfer between different units of system and providing fault tolerance.

There are many languages which provides libraries for mapreduce framework. Apache Hadoop provides an open source support for the implementation of MapReduce program over distributed network. “MapReduce” is a term which was originally a proprietary of Google technology, but later it has been generacized. MapReduce is a framework which is capable of processing parallelizable problems which deals with huge datasets, with the help of large number of computers, referred to as cluster or grid. MapReduce can process data in any form, whether structured (database) or unstructured (file system). It takes the advantage of locality of data (Lammel, 2008).

  • Map Step: Local data is processed by map() function which is applied on each worker node. If same input data is present at two instances, then that data is processed only once.
  • Shuffle Step: Data belonging to one key is redistributed to one worker node, such that each worker node contains data related to one key only.
  • Reduce Step: Each worker node now processes data belonging to same key in parallel.

Map and reduce operations are performed in distributed environment. In map() construct, mapping operations do not have any relation with each other, i.e., they are mutually independent. Being independent of each other, they can be parallelized. Hence, map() operations can be performed simultaneously. However parallel computation depends on the number of independent data sources or the number of CPUs available for processing. Similarly, reducers perform their function efficiently, only if output of the mapper function that belongs to the same intermediate key is provided to the same reducer node. MapReduce can be applied to a significantly large amount of data, for example, MapReduce framework can be used to sort terabyte/petabyte of data in ascending/descending order in few hours. The parallelism of framework also provides fault tolerance and reliability in processing: if one mapper or reducer goes out of proper functioning, processing can still be continued by rescheduling the work.

MapReduce can also be understood as a 5-step parallel and distributed computation (Lammel, 2008):

  • 1. Prepare the map() Input: There are several map processors which are assigned input key value K1 and all the data associated to that key value, that each map processor is going to work on.
  • 2. Run the User Provided map() Code: User provides the code for map() function depending on the type of problem statement. Map executes only once for each key value K1 and generates intermediate pairs which is organized by key values K2 at reduce processor.
  • 3. Shuffle the Output of map() Function to Reduce Processor: The output data produced by map() processors is redistributed to reducers so that each reducer contains data belonging to the same key.
  • 4. Run the reduce() Code Which is Provided by User Based on the Problem Statement: User provides the code for reduce() function based on the problem statement. Reduce runs exactly once for each key value K2.
  • 5. Produce the Final Output: Last step is to collect the output of each reduce processor and sort it by the key value K2.

The five steps described above are performed in sequence. i.e., next step cannot be started until and unless previous step completes. However, some intermediate steps can be interspersed only if final result remains the same.

In some cases, it is often found that input data is already partitioned among different servers, so step 1 is sped up because it is not needed to prepare the input, as map processor can process the data which is available on its own site. Similarly step 3 can also be performed efficiently by allocating reduce function to the data closely available to it.

Map and reduce function of MapReduce framework are defined in terms of pair. Map accepts pair of data related to a type of domain as input and returns a list of intermediate pairs related to entirely different domain.

A Mapper function performs its task simultaneously with other mapper functions. It processes input pair and produces a list of intermediate pairs as result. Once all mapper functions complete their tasks, it is the task of reducer functions to combine all the values belonging to same key and producing one group for each key representing all the values for that key.

Groups are now processed by different reducer function, which was given as output by map function, simultaneously. Reduce function produces a set of values that belong to same domain (key).

In the above system all the calls can return more than one value, whereas reduce functions return either value v3 or empty set. Finally, the result of all the reduce functions are combined to produce overall result. Therefore the MapReduce framework generates a list of pairs from a list of values.

Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >

Related topics