by Anoop Agarwal
In the last couple of articles, I discussed Hadoop/Big Data and its benefits. We know how Big Data is evolving and the Hadoop ecosystem can be useful to do analytics on a massive amount of un/semi/structured data.
We'll discuss various tools that are the key drivers to achieve the objective, mean data analysis on Big Data. After reading this article, you will know the fundamental ways to use some of the important Big Data tools that belong to different developer communities.
Big Data/Hadoop Tools
In the Big Data world within the Hadoop ecosystem, there are many tools available to process data laid on HDFS. As with the Hadoop framework, these tools also are part of open source like Hive, Pig, writing Map-Reduce program using Java, HBase, Phoenix, and many more. All these tools cover different parts of the developer community; for example: Java developers, SQL Programmers or Non Jan & SQL programmers, and so forth.
In the last article, we discussed various tools of the Hadoop ecosystem. In this series of articles we'll discuss three important tools: Writing Map-Reduce in Java, Hive, and Pig. These three tools belong to different technology stacks:
- JAVA Programmer/Developer: People who belong to this technology can process data by developing Map-Reduce programs in Java.
- SQL Programmer/Developer: People who belong to this technology can process data by using HIVE-QL Script and other Structured Query Languages.
- Non JAVA or SQL Programmer/Developer: People whobelong to this technology can process data by using a PIG script.
These three tools belong to different developer communities that widen the opportunity to most of the developer community to leverage the Hadoop framework for data processing in the Big Data world.
Understanding the MAP-Reduce Framework
Map-Reduce, or MR, is a framework written in Java to solve a problem that will be considered as a Job in the Hadoop environment. There are other components involved to run a MR program and give the desired results.
Anatomy of MR or Working of MR program: MR framework has set of trackers JobTracker & TaskTracker who help to run MR jobs on all available nodes of Hadoop cluster separately, track the processing and combine the output of MR jobs to produce the desired results.
Figure 1 can help you nderstand the flow of MR Job execution.
Figure 1: The flow of MR Job execution
Here is the step-by-step understanding of MR job processing and JobTracker & TaskTracker responsibilities:
Step 1: The client submits an MR job.
Step 2: JobTracker receives the MR job from the client and follows these steps:
Step-2.1: JobTracker assigns JobID and place job into an internal queue.
Step 2.2: The job scheduler picks the job from a queue and initializes it by creating an object encapsulated with its tasks and other information, such as the progress of each task.
Step 2.3: JobTracker uses an internally scheduling algorithm that determines which job's task should be processed first.
Step 2.4: JobTracker assigns tasks to TaskTracker, which is ready to available to run the task.
Step 3: TaskTracker has a fixed number of slots for its map and reduce task. By default, there are two slots for mapping and two slots for reducing tasks.
Step 3.1: TaskTracker first copies the jar from the shared file system into the local task tracker file system.
Step 3.2: TaskTracker creates a local working directory for task and unjars the content.
Step 3.3: TaskTracker creates an instance of TaskRunner to run each task. For each task, TaskRunner uses a new JVM so that the user-defined map and reduce functions don't affect other tasks.
Step 3.3.1: Clean up, The cleanup action commits the task where the output of tasks must be written; in other words, the location of the file.
Step 3.3.2: Setup actions. The Setup action involves initialization of the job with some parameters.
Step 3.3.3: Trigger the task execution.
Step 3.3.4: TaskRunner sends each task status to TaskTracker by sending its flag status.
Step 3.4: TaskTracker sends a heartbeat every five seconds to JobTracker to intimate the status of all tasks assigned to it.
Step 4: Once JobTracker receive the final status from all TraskTracker jobs, it follows this sequence:
Step 4.1: JobTracker combines all updates received from TaskTracker.
Step 4.2: Internally updates the status of JobID and generates the output.
Step 4.3: Returns job status to the MR client when it pings JobTracker.
Step 5: The Job client receives its status that is completed and the status is printed to the console, to be viewed by the user.
The preceding set of steps is helpful to understand the end-to-end flow of an MR Job. There are lots of other options available to change the priority and execution slots that can be configured in mapred-site.xml and many more.
Today, we'll learn to write a first MR program for solving one problem:
Problem: Find out how many people belong to each state.
For sample purposes, I prepared a users.txt file with five columns. Following is the file structure with sample data populated:
Pre-Requisite to the MR Program
To solve the previous sample problem, there are certain things that should be available and configured properly to get the desired output. I'll show you what tools should be installed and the required configuration that should be in place as a pre-requisite to start writing your first MR program.
- Eclipse IDE (Any latest version)
- JAVA 1.6 or 1.6+
- Hadoop 1.2.1
You can get all required configuration files after extracting the jar present in the Hadoop_1.2.1/conf folder:
Figure 2: Contents of mapred-site.xml
Figure 3: Contents of hdfs-site.xml
Figure 4: Contents of core-site.xml
Figure 5: Contents of hadoop-env.sh
You need to follow the next steps to confirm that Hadoop installed and all configurations are placed properly:
Step 1: Open a command prompt and browse the <installed-path>/hadoop-1.2.1 path.
Step 2: Run the ./bin/start-all.sh command to start Hadoop components (HDFS & MapReduce).
Step 3: You can validate that HDFS & MapReduce are running properly by using the following URLs in a browser:
Step 4: You can stop HDFS and MapReduce by running the followingcommand:
Once we are ready with the pre-requisites, we'll start writing the first MR program to solve the preceding problem. This MR program would contain three files:
- Driver Program
- Mapper Program
- Reduce Program
Let's open Eclipse and follow these steps to create the required programs:
Figure 6: Creating the required programs
Step 1: Create a new workspace with the name of "Hadoop"
Step 2: Create a new Java project "MRProgram" in Eclipse by using the JAVA environment built on Java 1.6 and above.
Step 3: Include the dependent Hadoop libraries into the "MRProgram" project.
Step 4: Create an input folder in the "MRProgram" project and create users.txt with the data specified in the section titled "Sample Problem".
Step 5: Open a .classpath of the "MRProgram" project, and make sure it has the following entries:
Figure 7: Examining the class path
Step 6: Write a Driver Program, its main file, which acts as a Job to submit the Mapper and Reducer programs. For example: UserDriver.java.
Figure 8: The UserDriver.java file
Step 7: Write a Mapper Program. It reads file line by line, parses each line separated by a ',' (comma), and then it writes data in the form of a key and value pair and makes it available to the reducer program. In our case, Key will be the state name and value will be the whole line.
Figure 9 is the UserMapper.java program:
Figure 9: The UserMapper.java program
Figure 10: The UserMapper.java program, after being run
Step 8: Write a Reducer Program. It reads mapper output, and processes each key and value of mapper. In our case, we have to count how many users are from each state.
Figure 11 is the Reducer program:
Figure 11: The Reducer program
Step 9: Run the UserDriver.java program in Eclipse. It will execute both the Mapper and Reducer programs and creates and output folder with following two files:
_SUCCESS: _SUCCESS does not contain anything.
Part-r-0000: Part-r-0000 contains the desired output of the MR Program.
Figure 12: Demonstrating Part-r-0000
The preceding output is the desired result, which gives state wise user count.
The output directory should be deleted before starting UserDriver.Java because the MR job doesn't overwrite the existing output directory and gives an exception.
In this article, we talked about the Big Data tools that are useful in data analysis. We discussed the fundamentals of the Map-Reduce framework and MR job processing. The availability of Map=Reduce has provided an immense opportunity for the Java developer community to enter into the data and analysis world. We wrote a sample MR program to solve the sample problem to understand the end-to-end requirement for the MR Program and its step-by-step execution.
In next article of this series, we'll talk about Hive and Pig.
About the Author
Anoop worked for Microsoft for almost six and half years and has 12+ years of IT experience. Currently, he is working as a DW\BI Architect in one of the top Fortune Companies. He has worked on end-to-end delivery of enterprise-scale DW\BI projects. He carries strong knowledge on database, data warehouse, and business intelligence application design and development and Hadoop/Big Data. Also, he worked extensively on SQL Server, designing ETLs using SSIS, SSAS, SSRS, and SQL Azure.
Disclaimer: I help people and businesses make better use of technology to realize their full potential. The opinions mentioned herein are solely mine and do not reflect those of my current employer or previous employers.