Wherever Big Data is mentioned, it is definitely followed by the word MapReduce As if by mentioning the two keywords, the companies have adapted into the new level mysterious Data Science or Big Data aspect. I am gonna try and get down to the understandable list of frequently asked questions on this topic.
MapReduce is programming model for computation of Big Data in distributed parallel servers. MapReduce can be seen as a simplified access and process of our distributed databases.
MapReduce is basically a software programming model / software framework, which allows us to process data in parallel across multiple computers in a cluster, often running on commodity hardware, in a reliable and fault-tolerant fashion.
What is the main functionality of MapReduce?
In a nutshell, MapReduce consists of two main functionality: to Map() and to Reduce().
- Map functionality takes the data from the local computer and distributes it to many servers for computation and storage. The Map Task in MapReduce is performed using the Map() function. This part of the MapReduce is responsible for processing one or more chunks of data and producing the output results.
- The next part / component / stage of the MapReduce programming model is the Reduce() function. This part of the MapReduce is responsible for consolidating the results produced by each of the Map() functions/tasks.Reduce functionality aggregates the result of computation from many servers back into the local computer.
- A Job in the context of Hadoop MapReduce is the unit of work to be performed as requested by the client / user. The information associated with the Job includes the data to be processed (input data), MapReduce logic / program / algorithm, and any other relevant configuration information necessary to execute the Job.
- Hadoop MapReduce divides a Job into multiple sub-jobs known as Tasks. These tasks can be run independently of each other on various nodes across the cluster. There are primarily two types of Tasks – Map Tasks and Reduce Tasks.
- Just like the storage (HDFS), the computation (MapReduce) also works in a master-slave / master-worker fashion. A JobTracker node acts as the Master and is responsible for scheduling / executing Tasks on appropriate nodes, coordinating the execution of tasks, sending the information for the execution of tasks, getting the results back after the execution of each task, re-executing the failed Tasks, and monitors / maintains the overall progress of the Job. Since a Job consists of multiple Tasks, a Job’s progress depends on the status / progress of Tasks associated with it. There is only one JobTracker node per Hadoop Cluster.
- A TaskTracker node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker. There is no restriction on the number of TaskTracker nodes that can exist in a Hadoop Cluster. TaskTracker receives the information necessary for the execution of a Task from JobTracker, Executes the Task, and Sends the Results back to JobTracker.
- MapReduce tries to place the data and the compute as close as possible. First, it tries to put the compute on the same node where data resides, if that cannot be done (due to reasons like compute on that node is down, compute on that node is performing some other computation, etc.), then it tries to put the compute on the node nearest to the respective data node(s) which contains the data to be processed. This feature of MapReduce is “Data Locality“.
What makes MapReduce better than the traditional distributed computing?
MapReduce provides a higher level of abstraction. Programmers do not need to handle system level details such as synchronization, shared memory, deadlock and race and many others that usually exists in the traditional distributed computing such as OpenMP, MPI etc.
What are the most common programming languages to implement MapReduce?
Java, Python, C++.
Who developed MapReduce?
Programmer team in Google.
What is the connection between MapReduce and Hadoop?
To implement MapReduce model, you can use Apache Hadoop. Use Pig (for dataflow style) and Hive (for SQL style) for processing in Hadoop ecosystem. You can also use Mahout for machine learning / data mining algorithm.
Where to get MapReduce?
See this website maybe… Apache Hadoop.
What is the basic data structure accepted in MapReduce?
You can use the primitives such as integers, floating point, strings, bytes, or more complex data structure such as lists, tuples, associative arrays, or you can also build your own custom data type. For the simplest way of thinking, it is easier if you think the basic data structure in term of a list of the key-value pair (key, value).
How to use MapReduce for my specific applications?
The following diagram shows the logical flow of a MapReduce programming model.
- Input: This is the input data / file to be processed.
- Split: Hadoop splits the incoming data into smaller pieces called “splits”.
- Map: In this step, MapReduce processes each split according to the logic defined in the map() function. Each mapper works on each split at a time. Each mapper is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.
- Combine: This is an optional step and is used to improve the performance by reducing the amount of data transferred across the network. The combiner is the same as the reduce step and is used for aggregating the output of the map() function before it is passed to the subsequent steps.
- Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them in order, and grouped before sending them to the next step.
- Reduce: This step is used to aggregate the outputs of mappers using the reduce() function. The output of reducer is sent to the next and final step. Each reducer is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.
- Output: Finally the output of reducing step is written to a file in HDFS.
Can you give me an example?
Assume that we have a file which contains the following four lines of text.
In this file, we need to count the number of occurrences of each word. For instance, DW appears twice, BI appears once, SSRS appears twice, and so on. Let us see how this counting operation is performed when this file is input to MapReduce.
Below is a simplified representation of the data flow for Word Count Example.
- Input: In this step, the sample file is input to MapReduce.
- Split: In this step, Hadoop splits / divides our sample input file into four parts, each part made up of one line from the input file. Note that, for the purpose of this example, we are considering one line as each split. However, this is not necessarily true in a real-time scenario.
- Map: In this step, each split is fed to a mapper which is the map() function containing the logic on how to process the input data, which in our case is the line of text present in the split. For our scenario, the map() function would contain the logic to count the occurrence of each word and each occurrence is captured / arranged as a (key, value) pair, which in our case is like (SQL, 1), (DW, 1), (SQL, 1), and so on.
- Combine: This is an optional step and is often used to improve the performance by reducing the amount of data transferred across the network. This is essentially the same as the reducer (reduce() function) and acts on output from each mapper. In our example, the key value pairs from the first mapper “(SQL, 1), (DW, 1), (SQL, 1)” are combined and the output of the corresponding combiner becomes “(SQL, 2), (DW, 1)”.
- Shuffle and Sort: In this step, the output of all the mappers is collected, shuffled, and sorted and arranged to be sent to the reducer.
- Reduce: In this step, the collective data from various mappers, after being shuffled and sorted, is combined / aggregated and the word counts are produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on.
- Output: In this step, the output of the reducer is written to a file on HDFS. The following image is the output of our word count example.
Why opt for Hadoop MapReduce?
Here are few highlights of MapReduce programming model in Hadoop:
- MapReduce works in a master-slave / master-worker fashion. JobTracker acts as the master and TaskTrackers act as the slaves.
- MapReduce has two major phases – A Map phase and a Reduce phase. Map phase processes parts of input data using mappers based on the logic defined in the map() function. The Reduce phase aggregates the data using a reducer based on the logic defined in the reduce() function.
- Depending upon the problem at hand, we can have One Reduce Task, Multiple Reduce Tasks or No Reduce Tasks.
- MapReduce has built-in fault tolerance and hence can run on commodity hardware.
- MapReduce takes care of distributing the data across various nodes, assigning the tasks to each of the nodes, getting the results back from each node, re-running the task in case of any node failures, consolidation of results, etc.
- MapReduce processes the data in the form of (Key, Value) pairs. Hence, we need to fit out the business problem in this Key-Value arrangement.