Friday, February 23, 2018

Introduction of MapReduce

MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.
It contains the task of data processing and distributes the particular tasks across the nodes. It consists of two phases –
  • Map
  • Reduce
Map converts a typical dataset into another set of data where individual elements are divided into key/value pairs.
Reduce task takes the output files from a map considering as an input and then integrate the data tuples into a smaller set of tuples. Always it is been executed after the map job is done.

Features of Mapreduce system
Features of Mapreduce are as follows:
  • Framework is provided for Mapreduce execution
  • Abstracts developer from the complexity of distributed programming languages.
  • Partial failure of the processing cluster is expected and tolerable to fulfill the requirements.
  • In-built Redundancy and fault tolerance is available.
  • Mapreduce programming model system is language independent.
  • Automatic parallelization and distribution are in charge.
  • Fault tolerance
  • Enable data local processing
  • Shared nothing than architectural model
  • Manages all the inter process communication
  • Parallelly managing the distributed servers which are running across the various tasks.
  • Managing all communications and data transfers between the various part of system module.
  • Redundancy and failures are provided for overall management of the whole process.

Mapreduce simple steps follow:

  1. Executes map function on each input is received
  2. Map function emits key, value pair
  3. Shuffle, Sort and Group the outputs
  4. Executes the reduce function on the group
  5. Emits the output results is given per group basis.

Map Function

Mainly operates on each key/value pair of data and then transforms the data based on the transformation logic provided in the map function. Map function always produces a key/value pair as output result.
Map (key1, value1) ->List (key2, value2)

Reduce Function

It takes list of value for each and every key transforms the data based on the (aggregation) logic provided in the reduce function.
Reduce (key2, List (value2)) ->List (key3, value3)
Map Function for Word Count
private final staic IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, Context context)

throws IOException, InterruptedException{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens()){ word.set(tokenizer.nextToken()); context.write(word, one); } } Reduce Function for Word Count public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{ int sum = 0; for(IntWritable val: values){ sum+=val.get(); } Context.write(key, new IntWriatble(sum)); }

MapReduce is the framework that is used for processing large amounts of data on commodity hardware on a huge dataset of cluster ecosystem. The MapReduce is a powerful method of processing data when there are large amounts of node connected to the cluster. The two important tasks of the MapReduce algorithm are: Map and Reduce.
The main motto of the Map task is to take a large set of data and convert it into another set of data which is broken down into tuples(rows) or Key/Value pairs. Later the Reduce task takes the tuple which is the form of an output of the Map task and makes the input for a reduction task. Here the data tuples are converted into a very smaller set of tuples. The Reduce task always follows as per the Map task.
The biggest strength of the MapReduce framework is its scalability. Once a MapReduce program is written then it can be easily extrapolated to work over a cluster which has hundreds or even thousands of nodes within it. In this framework, actually computation is sent to where the data resides.

Terminology

PayLoad– These are the applications that are implemented for the Map and Reduce functions.
Mapper– This application helps to maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode– This node manages the HDFS.
DataNode– DataNode is used where data is presented in a before any processing takes place.
MasterNode– MasterNode is used where JobTracker runs and receives job requests from clients.
SlaveNode– Map and Reduce program run particularly in this node.
JobTracker– This schedules the jobs and tracks the assigns the jobs to Task tracker.
Task Tracker– the Task Tracker status is reported to JobTracker after the task is being tracked.
Job– It is an execution process of a Mapper and Reducer.
Task– Task of an execution of a Mapper or a called as Reducer on a slice of data.
Task Attempt– This is an attempt to execute a task on a SlaveNode.