Friday, November 2, 2018

Creating A GraphQL Server With Node.js And Express

Creating A GraphQL Server With Node.js And Express

GraphQL is a language that enables you to provide a complete and understandable description of the data in your API. Furthermore it gives clients the power to ask for exactly what they need and nothing more. The project’s website can be found at http://graphql.org/.
There are several advantages of GraphQL.
GraphQL is declarative: Query responses are decided by the client rather than the server. A GraphQL query returns exactly what a client asks for and no more.
GraphQL is compositional: A GraphQL query itself is a hierarchical set of fields. The query is shaped just like the data it returns. It is a natural way for product engineers to describe data requirements.
GraphQL is strongly-typed: A GraphQL query can be ensured to be valid within a GraphQL type system at development time allowing the server to make guarantees about the response. This makes it easier to build high-quality client tools
In this tutorial you’ll learn how to setup a GraphQL server with Node.js and Express. We’ll be using the Express middleware express-graphql in our example. Furthermore you’ll learn how to use GraphQL on the client side to send queries and mutations to the server.
Let’s get started …
Setting Up The Project
To setup a GraphQL Node.js server let’s start with creating a new empty project folder first:
$ mkdir gql-server
Change into that directory and initiate a new package.json file by executing the following NPM command:
$ npm init
Furthermore create a new server.js file in the project directory. That will be the file where the code required to implement the Node.js GraphQL server will be inserted in the next section:
$ touch server.js
Finally make sure that NPM packages graphqlexpress and express-graphql are added to the project:
$ npm install graphql express express-graphql —save
Having installed these packages successfully we’re now ready to implement a first GraphQL server.
Creating A Basic GraphQL Server With Express
Now that the Project setup is ready let’s create the a first server implementation by inserting the following JS code in server.js:
var express = require('express');
var express_graphql = require('express-graphql');
var { buildSchema } = require('graphql');
// GraphQL schema
var schema = buildSchema(`
    type Query {
        message: String
    }
`);
// Root resolver
var root = {
    message: () => 'Hello World!'
};
// Create an express server and a GraphQL endpoint
var app = express();
app.use('/graphql', express_graphql({
    schema: schema,
    rootValue: root,
    graphiql: true
}));
app.listen(4000, () => console.log('Express GraphQL Server Now Running On localhost:4000/graphql'));
At first we’re making sure that expressexpress-graphql and the buildSchema function from the graphql package are imported. Next we’re creating a simple GraphQL schema by using the buildSchema function.
To create the schema we’re calling the function and passing in a string that contains the IDL (GraphQL Interface Definition Language) code which is used to describe the schema. A GraphQL schema is used to describe the complete APIs type system. It includes the complete set of data and defines how a client can access that data. Each time the client makes an API call, the call is validated against the schema. Only if the validation is successful the action is executed. Otherwise an error is returned.
Next a root resolver is created. A resolver contains the mapping of actions to functions. In our example from above the root resolver contains only one action: message. To keep things easy the assigned functions just returns the string Hello World!. Later on you’ll learn how to include multiple actions and assign different resolver functions.
Finally the Express server is created with a GraphQL endpoint: /graphql. To create the GraphQL endpoint first a new express instance is stored in app. Next the app.use method is called and two parameters are provided:
·         First the URL endpoint as string
·         Second the result of the express_graphql function is handed over. A configuration object is passed into the call of express_graphql containing three properties
The three configuration properties which are used for the Express GraphQL middleware are the following:
·         schema: The GraphQL schema which should be attached to the specific endpoint
·         rootValue: The root resolver object
·         graphiql: Must be set to true to enable the GraphiQL tool when accessing the endpoint in the browser. GraphiQL is a graphical interactive in-browser GraphQL IDE. By using this tool you can directly write your queries in the browser and try out the endpoint.
Finally app.listen is called to start the server process on port 4000.
The Node.js server can be started by executing the following command in the project directory:
$ node server.js
Having started the server process you should be able to see the output
Express GraphQL Server Now Running On localhost:4000/graphql
on the command line. If you access localhost:4000/graphql in the browser you should be able to see the following result:
In the query editor type in the following code:
{
    message
}
Next hit the Execute Query button and you should be able to see the following result:
Implementing A More Sophisticated Example
Now that you have a basic understanding of how to implement a GraphQL server with Node.js and Express, let’s continue with a more sophisticated example. Add a new JS file to the project:
$ touch server2.js
Next let’s add the following implementation:
var express = require('express');
var express_graphql = require('express-graphql');
var { buildSchema } = require('graphql');
// GraphQL schema
var schema = buildSchema(`
    type Query {
        course(id: Int!): Course
        courses(topic: String): [Course]
    },
    type Course {
        id: Int
        title: String
        author: String
        description: String
        topic: String
        url: String
    }
`);
var coursesData = [
    {
        id: 1,
        title: 'The Complete Node.js Developer Course',
        author: 'Andrew Mead, Rob Percival',
        description: 'Learn Node.js by building real-world applications with Node, Express, MongoDB, Mocha, and more!',
        topic: 'Node.js',
        url: 'https://codingthesmartway.com/courses/nodejs/'
    },
    {
        id: 2,
        title: 'Node.js, Express & MongoDB Dev to Deployment',
        author: 'Brad Traversy',
        description: 'Learn by example building & deploying real-world Node.js applications from absolute scratch',
        topic: 'Node.js',
        url: 'https://codingthesmartway.com/courses/nodejs-express-mongodb/'
    },
    {
        id: 3,
        title: 'JavaScript: Understanding The Weird Parts',
        author: 'Anthony Alicea',
        description: 'An advanced JavaScript course for everyone! Scope, closures, prototypes, this, build your own framework, and more.',
        topic: 'JavaScript',
        url: 'https://codingthesmartway.com/courses/understand-javascript/'
    }
]
var getCourse = function(args) {
    var id = args.id;
    return coursesData.filter(course => {
        return course.id == id;
    })[0];
}
var getCourses = function(args) {
    if (args.topic) {
        var topic = args.topic;
        return coursesData.filter(course => course.topic === topic);
    } else {
        return coursesData;
    }
}
var root = {
    course: getCourse,
    courses: getCourses
};
// Create an express server and a GraphQL endpoint
var app = express();
app.use('/graphql', express_graphql({
    schema: schema,
    rootValue: root,
    graphiql: true
}));
app.listen(4000, () => console.log('Express GraphQL Server Now Running On localhost:4000/graphql'));
Ok, let’s examine the code step by step. First we’re defining a schema which now consists of a custom type Course and two query actions.
The Course object type consist of six properties in total. The defined query actions enable the user to retrieve a single course by ID or retrieving an array of Course objects by course topic.
To be able to return data without the need to connect to a database we’re defining the coursesData array with some dummy course data inside.
In the root resolver we’re connecting the course query action to the getCourse function and the courses query action to the getCourses function.
Accessing The GraphQL API
Now let’s start the Node.js server process again and execute the code from file server2.js with the following command:
$ node server2.js
If you’re opening up URL localhost:4000/graphql in the browser you should be able to see the GraphiQL web interface, so that you can start typing in queries. First let’s retrieve one single course from our GraphQL endpoint. Insert the following query code:
query getSingleCourse($courseID: Int!) {
    course(id: $courseID) {
        title
        author
        description
        topic
        url
    }
}
The getSingleCourse query operation is expecting to get one parameter: $courseID of type Int. By usign the exclamation mark we’re specifying that this parameters needs to be provided.
Within the getSingleCourse we’re executing the course query and for this specific ID. We’re specifying that we’d like to retrieve titleauthordescriptiontopic and url of that that specific course.
Because the getSingleCourse query operation uses a dynamic parameter we need to supply the value of this parameter in the Query Variables input field as well:
{
    "courseID":1
}
Click on the execute button and you should be able to see the following result:
Using Aliases & Fragments
You’re able to include multiple queries in one query operation. In the following example the getCourseWithFragments query operations contains two queries for single courses. To distinguish between both queries we’re assigning aliases: course1 and course2.
query getCourseWithFragments($courseID1: Int!, $courseID2: Int!) {
      course1: course(id: $courseID1) {
             ...courseFields
      },
      course2: course(id: $courseID2) {
            ...courseFields
      }
}
fragment courseFields on Course {
  title
  author
  description
  topic
  url
}
As you can see the query operations requires two parameters: courseID1 and courseID2. The first ID is used for the first query and the second ID is used for the second query.
Another feature which is used is a fragment. By using a fragment we’re able to avoid repeating the same set of fields in multiple queries. Instead we’re defining a reusable fragment with name courseFields and specific which fields are relevent for both queries in one place.
Before executing the query operation we need to assign values to the parameters:
{
    "courseID1":1,
    "courseID2":2
}
The result should look like the following:
Creating And Using Mutations
So far we’ve only seen examples which fetches data from our GraphQL server. With GraphQL we’re also able to modify data. by using Mutations. To be able to use a mutation with out GraphQL server we first need to add code to our server implementation in server2.js:
// GraphQL schema
var schema = buildSchema(`
    type Query {
        course(id: Int!): Course
        courses(topic: String): [Course]
    },
    type Mutation {
        updateCourseTopic(id: Int!, topic: String!): Course
    }
    type Course {
        id: Int
        title: String
        author: String
        description: String
        topic: String
        url: String
    }
`);
Here you can see the schema is now containing a Mutation type as well. The mutation which is defined is named updateCourseTopic and takes two mandatory parameter: id and topic. The return type of that mutation is Course.
Using that mutation it should be possible to change the topic of a specific course. In the same way like we did it before for queries we’re now assigning a function to the mutation in the root resolver. The function is implemented with the corresponding update logic:
var updateCourseTopic = function({id, topic}) {
    coursesData.map(course => {
        if (course.id === id) {
            course.topic = topic;
            return course;
        }
    });
    return coursesData.filter(course => course.id === id) [0];
}
var root = {
    course: getCourse,
    courses: getCourses,
    updateCourseTopic: updateCourseTopic
};
Now the sever is able to handle mutations as well, so let’s try it out in the GraphiQL browser interface again.
A mutation operation is defined by using the mutation keyword followed by the name of the mutation operation. In the following example the updateCourseTopic mutation is included in the operation and again we’re making use of the courseFields fragment.
mutation updateCourseTopic($id: Int!, $topic: String!) {
  updateCourseTopic(id: $id, topic: $topic) {
    ... courseFields
  }
}
The mutation operation is using two dynamic variables so we need to assign the values in the query variables input field as follows:
{
  "id": 1,
  "topic": "JavaScript"
}
By executing this mutation we’re changing the value of the topic property for the course data set with ID 1 from Node.js to JavaScript. As a result we’re getting back the changed course:
Conclusion
GraphQL provides a complete and understandable description of the data in your API, gives clients the power to ask for exactly what they need and nothing more.
In this tutorial you’ve learned how to implement your own GraphQL server with Node.js and Express. By using the Express middleware express-graphqlsetting up a GraphQL server is really easy and is requiring only few lines of code. If you’d like to to dive


Saturday, September 15, 2018

Apache Kafka

Introduction to Kafka using NodeJs


Building a B2B healthcare product from scratch for the U.S market
This is a small article intended for node.js developers who intend to start implementing distributed messaging system using Kakfa.

I am planning to write a series of articles demonstrating the usage of Kafka and Storm. This article is the first of the same series. So let's begin.

1.1 What is Kafka ?

Kafka is a distributed messaging system providing fast, highly scalable and redundant messaging through a pub-sub model. Kafka’s distributed design gives it several advantages. First, Kafka allows a large number of permanent or ad-hoc consumers. Second, Kafka is highly available and resilient to node failures and supports automatic recovery. In real world data systems, these characteristics make Kafka an ideal fit for communication and integration between components of large scale data systems.

The Kafka Documentation has done an excellent job in explaining the entire architecture.

Before Moving ahead i would suggest the reader to go through the following link. It is very important to understand the architecture.

https://kafka.apache.org/intro

1.2 Installing & Running Zookeeper and Kafka

Kafka can be downloaded from the following link. I am using the current stable release i.e. 0.10.1.1.

https://kafka.apache.org/downloads

Download the tar. Un-tar it and then follow the steps below:

Kafka uses ZooKeeper so you need to first start a ZooKeeper server if you don't already have one. Run the following command to start ZooKeeper:

$ bin/zookeeper-server-start.sh config/zookeeper.properties

Now to start kafka run the following command:

$ bin/kafka-server-start.sh config/server.properties

1.3 Creating Kafka Topic and playing with it

Let's create one topic and play with it. Below is the command to create a topic

$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic Posts

Once you create the topic, you can see the available topics with below command:

$bin/kafka-topics.sh --list --zookeeper localhost:2181

For testing kafka, we can use the kafka-console-producer to send a message

$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic Posts

We can consume all the messages of the same topic by creating a consumer as below:

$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic Posts --from-beginning


1.3 Integrating Kafka with NodeJS

Let's create a API in NodeJS which will act as a Producer to Kafka. We will be then creating another consumer in NodeJS which will be consuming the topic we created above.

We will be using kafka-node and express module for our producer.

var express = require('express');
var kafka = require('kafka-node');
var app = express();

Let's add the code to handle JSON in our api.

var bodyParser = require('body-parser')
app.use( bodyParser.json() );       // to support JSON-encoded bodies
app.use(bodyParser.urlencoded({     // to support URL-encoded bodies
  extended: true
}));

Now in order to create a kafka producer where you have non-keyed partition, you can simply add the following code

var Producer = kafka.Producer,
    client = new kafka.Client(),
    producer = new Producer(client);
Now let's add some event handler for our producer. These will help us know the state of the producer.

producer.on('ready', function () {
    console.log('Producer is ready');
});

producer.on('error', function (err) {
    console.log('Producer is in error state');
    console.log(err);
})
Now Before going into producing a message to a kafka topic, let us first create a simple route and test our api. Add the below code

app.get('/',function(req,res){
    res.json({greeting:'Kafka Consumer'})
});

app.listen(5001,function(){
    console.log('Kafka producer running at 5001')
});
So, Now the entire code looks like below:

var express = require('express');
var kafka = require('kafka-node');
var app = express();

var bodyParser = require('body-parser')
app.use( bodyParser.json() );       // to support JSON-encoded bodies
app.use(bodyParser.urlencoded({     // to support URL-encoded bodies
  extended: true
}));

var Producer = kafka.Producer,
    client = new kafka.Client(),
    producer = new Producer(client);

producer.on('ready', function () {
    console.log('Producer is ready');
});

producer.on('error', function (err) {
    console.log('Producer is in error state');
    console.log(err);
})


app.get('/',function(req,res){
    res.json({greeting:'Kafka Producer'})
});

app.listen(5001,function(){
    console.log('Kafka producer running at 5001')
})
So let's run the code and test our api in postman.






Now lets create a route which can post some message to the topic.

For the nodejs client, kafka has a producer.send() method which takes two arguments. the first being "payloads" which is an array of ProduceRequest. ProduceRequest is a JSON object like:

{
   topic: 'topicName',
   messages: ['message body'], // multi messages should be a array, single message can be just a string or a KeyedMessage instance
   key: 'theKey', // only needed when using keyed partitioner (optional)
   partition: 0, // default 0 (optional)
   attributes: 2 // default: 0 used for compression (optional)
}
Add the below code to get the topic and the message to be sent .

app.post('/sendMsg',function(req,res){
    var sentMessage = JSON.stringify(req.body.message);
    payloads = [
        { topic: req.body.topic, messages:sentMessage , partition: 0 }
    ];
    producer.send(payloads, function (err, data) {
            res.json(data);
    });
   
})
Now let's run the code and hit our api with a payload. Once the producer pushes the message to the topic, we can see the message get consumed in the default shell consumer we created earlier.

Now Let's create a simple consumer for this in nodejs.

In NodeJS, Kafka consumers can be created using multiple ways. The following is the most simple one out of them all:

Consumer(client, payloads, options)
It takes 3 arguments as above. "client" is the one which keeps a connection with the Kafka server. payloads is an array of FetchRequest, FetchRequest is a JSON object like:

{
   topic: 'topicName',
   offset: 0, //default 0
}
the all possible options for the client are as below:

{
    groupId: 'kafka-node-group',//consumer group id, default `kafka-node-group`
    // Auto commit config
    autoCommit: true,
    autoCommitIntervalMs: 5000,
    // The max wait time is the maximum amount of time in milliseconds to block waiting if insufficient data is available at the time the request is issued, default 100ms
    fetchMaxWaitMs: 100,
    // This is the minimum number of bytes of messages that must be available to give a response, default 1 byte
    fetchMinBytes: 1,
    // The maximum bytes to include in the message set for this partition. This helps bound the size of the response.
    fetchMaxBytes: 1024 * 1024,
    // If set true, consumer will fetch message from the given offset in the payloads
    fromOffset: false,
    // If set to 'buffer', values will be returned as raw buffer objects.
    encoding: 'utf8'
}
So let's add the code below to create a simple consumer.

var kafka = require('kafka-node'),
    Consumer = kafka.Consumer,
    client = new kafka.Client(),
    consumer = new Consumer(client,
        [{ topic: 'Posts', offset: 0}],
        {
            autoCommit: false
        }
    );
Let us add some simple event handlers. One of which notifies us when a message is consumed. For simplicity of the article, let us just do console.log

consumer.on('message', function (message) {
    console.log(message);
});

consumer.on('error', function (err) {
    console.log('Error:',err);
})

consumer.on('offsetOutOfRange', function (err) {
    console.log('offsetOutOfRange:',err);
})
The entire code of the consumer looks like below:

var kafka = require('kafka-node'),
    Consumer = kafka.Consumer,
    client = new kafka.Client(),
    consumer = new Consumer(client,
        [{ topic: 'Posts', offset: 0}],
        {
            autoCommit: false
        }
    );

consumer.on('message', function (message) {
    console.log(message);
});

consumer.on('error', function (err) {
    console.log('Error:',err);
})

consumer.on('offsetOutOfRange', function (err) {
    console.log('offsetOutOfRange:',err);
})
Before testing this consumer, let us first kill the shell consumer. Then hit our producer api


This is the end of this article. But in future articles i am planning to showcase a bit more complicated usage of Kafka.

Hope this article helps!




Monday, September 10, 2018

Describe Node.js

Introduction to Node.js

The modern web application has really come a long way over the years with the introduction of many popular frameworks such as bootstrap, Angular JS, etc. All of these frameworks are based on the popularJavaScript framework.
But when it came to developing server based applications there was just kind of a void, and this is where Node.js came into the picture.
Node.js is also based on the JavaScript framework, but it is used for developing server-based applications. While going through the entire tutorial, we will look into Node.js in detail and how we can use it to develop server based applications.



What is Node.js?

Node.js is an open-source, cross-platform runtime environment used for development of server-side web applications. Node.js applications are written in JavaScript and can be run on a wide variety of operating systems.
Node.js is based on an event-driven architecture and a non-blocking Input/Output API that is designed to optimize an application's throughput and scalability for real-time web applications.
Over a long period of time, the framework available for web development were all based on a stateless model. A stateless model is where the data generated in one session (such as information about user settings and events that occurred) is not maintained for usage in the next session with that user.
A lot of work had to be done to maintain the session information between requests for a user. But with Node.js there is finally a way for web applications to have a real-time, two-way connections, where both the client and server can initiate communication, allowing them to exchange data freely.


Why use Node.js?

We will have a look into the real worth of Node.js in the coming chapters, but what is it that makes this framework so famous. Over the years, most of the applications were based on a stateless request-response framework. In these sort of applications, it is up to the developer to ensure the right code was put in place to ensure the state of web session was maintained while the user was working with the system.
But with Node.js web applications, you can now work in real-time and have a 2-way communication. The state is maintained, and the either the client or server can start the communication.


Features of Node.js

Let's look at some of the key features of Node.js
1. Asynchronous event driven IO helps concurrent request handling – This is probably the biggest selling points of Node.js. This feature basically means that if a request is received by Node for some Input/Output operation, it will execute the operation in the background and continue with processing other requests.
This is quite different from other programming languages. A simple example of this is given in the code below

var fs = require('fs'); 
          fs.readFile("Sample.txt",function(error,data)
          {
                console.log("Reading Data completed");
     });

The above code snippet looks at reading a file called Sample.txt. In other programming languages, the next line of processing would only happen once the entire file is read.
But in the case of Node.js the important fraction of code to notice is the declaration of the function ('function(error,data)'). This is known as a callback function.
So what happens here is that the file reading operation will start in the background. And other processing can happen simultaneously while the file is being read. Once the file read operation is completed, this anonymous function will be called and the text "Reading Data completed" will be written to the console log.

2. Node uses the V8 JavaScript Runtime engine, the one which is used by Google Chrome. Node has a wrapper over the JavaScript engine which makes the runtime engine much faster and hence processing of requests within Node also become faster.
3. Handling of concurrent requests – Another key functionality of Node is the ability to handle concurrent connections with a very minimal overhead on a single process.
4. The Node.js library used JavaScript – This is another important aspect of development in Node.js. A major part of the development community are already well versed in javascript, and hence, development in Node.js becomes easier for a developer who knows javascript.
5. There are an Active and vibrant community for the Node.js framework. Because of the active community, there are always keys updates made available to the framework. This helps to keep the framework always up-to-date with the latest trends in web development.


Who uses Node.js

Node.js is used by a variety of large companies. Below is a list of a few of them.
Paypal – A lot of sites within Paypal have also started the transition onto Node.js.
LinkedIn - LinkedIn is using Node.js to power their Mobile Servers, which powers the iPhone, Android, and Mobile Web products.
Mozilla has implemented Node.js to support browser APIs which has half a billion installs.
Ebay hosts their HTTP API service in Node.js

When to Use Node.js

Node.js is best for usage in streaming or event-based real-time applications like
1. Chat applications
2. Game servers – Fast and high-performance servers that need to processes thousands of requests at a time, then this is an ideal framework.
3. Good for collaborative environment – This is good for environments which manage document. In document management environment you will have multiple people who post their documents and do constant changes by checking out and checking in documents. So Node.js is good for these environments because the event loop in Node.js can be triggered whenever documents are changed in a document managed environment.
4. Advertisement servers – Again here you could have thousands of request to pull advertisements from the central server and Node.js can be an ideal framework to handle this.
5. Streaming servers – Another ideal scenario to use Node is for multimedia streaming servers wherein clients have request's to pull different multimedia contents from this server.
Node.js is good when you need high levels of concurrency but less amount of dedicated CPU time.
Best of all, since Node.js is built on javascript, it's best suited when you build client-side applications which are based on the same javascript framework.


When to not use Node.js

Node.js can be used for a lot of applications with various purpose, the only scenario where it should not be used is if there are long processing times which is required by the application.
Node is structured to be single threaded. If any application is required to carry out some long running calculations in the background. So if the server is doing some calculation, it won't be able to process any other requests. As discussed above, Node.js is best when processing needs less dedicated CPU time.

Friday, February 23, 2018

Introduction of MapReduce

MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.
It contains the task of data processing and distributes the particular tasks across the nodes. It consists of two phases –
  • Map
  • Reduce
Map converts a typical dataset into another set of data where individual elements are divided into key/value pairs.
Reduce task takes the output files from a map considering as an input and then integrate the data tuples into a smaller set of tuples. Always it is been executed after the map job is done.

Features of Mapreduce system
Features of Mapreduce are as follows:
  • Framework is provided for Mapreduce execution
  • Abstracts developer from the complexity of distributed programming languages.
  • Partial failure of the processing cluster is expected and tolerable to fulfill the requirements.
  • In-built Redundancy and fault tolerance is available.
  • Mapreduce programming model system is language independent.
  • Automatic parallelization and distribution are in charge.
  • Fault tolerance
  • Enable data local processing
  • Shared nothing than architectural model
  • Manages all the inter process communication
  • Parallelly managing the distributed servers which are running across the various tasks.
  • Managing all communications and data transfers between the various part of system module.
  • Redundancy and failures are provided for overall management of the whole process.

Mapreduce simple steps follow:

  1. Executes map function on each input is received
  2. Map function emits key, value pair
  3. Shuffle, Sort and Group the outputs
  4. Executes the reduce function on the group
  5. Emits the output results is given per group basis.

Map Function

Mainly operates on each key/value pair of data and then transforms the data based on the transformation logic provided in the map function. Map function always produces a key/value pair as output result.
Map (key1, value1) ->List (key2, value2)

Reduce Function

It takes list of value for each and every key transforms the data based on the (aggregation) logic provided in the reduce function.
Reduce (key2, List (value2)) ->List (key3, value3)
Map Function for Word Count
private final staic IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, Context context)

throws IOException, InterruptedException{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens()){ word.set(tokenizer.nextToken()); context.write(word, one); } } Reduce Function for Word Count public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{ int sum = 0; for(IntWritable val: values){ sum+=val.get(); } Context.write(key, new IntWriatble(sum)); }

MapReduce is the framework that is used for processing large amounts of data on commodity hardware on a huge dataset of cluster ecosystem. The MapReduce is a powerful method of processing data when there are large amounts of node connected to the cluster. The two important tasks of the MapReduce algorithm are: Map and Reduce.
The main motto of the Map task is to take a large set of data and convert it into another set of data which is broken down into tuples(rows) or Key/Value pairs. Later the Reduce task takes the tuple which is the form of an output of the Map task and makes the input for a reduction task. Here the data tuples are converted into a very smaller set of tuples. The Reduce task always follows as per the Map task.
The biggest strength of the MapReduce framework is its scalability. Once a MapReduce program is written then it can be easily extrapolated to work over a cluster which has hundreds or even thousands of nodes within it. In this framework, actually computation is sent to where the data resides.

Terminology

PayLoad– These are the applications that are implemented for the Map and Reduce functions.
Mapper– This application helps to maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode– This node manages the HDFS.
DataNode– DataNode is used where data is presented in a before any processing takes place.
MasterNode– MasterNode is used where JobTracker runs and receives job requests from clients.
SlaveNode– Map and Reduce program run particularly in this node.
JobTracker– This schedules the jobs and tracks the assigns the jobs to Task tracker.
Task Tracker– the Task Tracker status is reported to JobTracker after the task is being tracked.
Job– It is an execution process of a Mapper and Reducer.
Task– Task of an execution of a Mapper or a called as Reducer on a slice of data.
Task Attempt– This is an attempt to execute a task on a SlaveNode.