Home Computer Science Applied Big Data Analytics in Operations Management
MapReduce is a programming model which is used for processing and generating large data sets in form of
MapReduce is a program that consists of map() and reduce() function. map() function performs filtering and sorting over data, for example, sorting the names of students into queues. reduce() function carries the summary operation over the resultant data, for example, counting the number of students that are present in each row (Lammel, 2008). The “MapReduce” system gives its best performance over a distributed network, performing several tasks in parallel, managing data and control transfer between different units of system and providing fault tolerance.
There are many languages which provides libraries for mapreduce framework. Apache Hadoop provides an open source support for the implementation of MapReduce program over distributed network. “MapReduce” is a term which was originally a proprietary of Google technology, but later it has been generacized. MapReduce is a framework which is capable of processing parallelizable problems which deals with huge datasets, with the help of large number of computers, referred to as cluster or grid. MapReduce can process data in any form, whether structured (database) or unstructured (file system). It takes the advantage of locality of data (Lammel, 2008).
Map and reduce operations are performed in distributed environment. In map() construct, mapping operations do not have any relation with each other, i.e., they are mutually independent. Being independent of each other, they can be parallelized. Hence, map() operations can be performed simultaneously. However parallel computation depends on the number of independent data sources or the number of CPUs available for processing. Similarly, reducers perform their function efficiently, only if output of the mapper function that belongs to the same intermediate key is provided to the same reducer node. MapReduce can be applied to a significantly large amount of data, for example, MapReduce framework can be used to sort terabyte/petabyte of data in ascending/descending order in few hours. The parallelism of framework also provides fault tolerance and reliability in processing: if one mapper or reducer goes out of proper functioning, processing can still be continued by rescheduling the work.
MapReduce can also be understood as a 5-step parallel and distributed computation (Lammel, 2008):
The five steps described above are performed in sequence. i.e., next step cannot be started until and unless previous step completes. However, some intermediate steps can be interspersed only if final result remains the same.
In some cases, it is often found that input data is already partitioned among different servers, so step 1 is sped up because it is not needed to prepare the input, as map processor can process the data which is available on its own site. Similarly step 3 can also be performed efficiently by allocating reduce function to the data closely available to it.
Map and reduce function of MapReduce framework are defined in terms of
A Mapper function performs its task simultaneously with other mapper functions. It processes input
Groups are now processed by different reducer function, which was given as output by map function, simultaneously. Reduce function produces a set of values that belong to same domain (key).
In the above system all the calls can return more than one value, whereas reduce functions return either value v3 or empty set. Finally, the result of all the reduce functions are combined to produce overall result. Therefore the MapReduce framework generates a list of
|< Prev||CONTENTS||Next >|