Dealing with tons of data requires some special arrangement. Common computational techniques are insufficient to handle a floodgate of data;, more so, when they are coming from multiple sources. In Big Data, the magnitude we are talking about is massive—measured in zettabytes, exabytes, or millions of petabytes or billions of terabytes. The framework called Hadoop is popularly used to handle some of the issues with Big Data management. This article attempts to give an introductory idea about Hadoop in the light of Big Data.
Nothing happens with a big bang. The origin and evolution of Hadoop is gradual and according to the need of the hour in dealing with Big Data. To state briefly, it owes its origin to Doug Cutting's Apache Nutch project in the year 2003, especially in the beginning the code part of it. The genesis was developed from Google File System (GFS), a paper that was published in October 2003, which influenced another paper called MapReduce: Simplified Data Processing on Large Clusters. The code for HDFS in Hadoop is factored out from the Apache Nutch project in 2006 and is greatly influenced by the GFS and MapReduce algorithms. And, the fact that the name "Hadoop" came from Cutting's son's stuffed elephant toy clearly resonates the idea that there is an elephant in the room which Hadoop clearly wants to address or deal with.
In a Nutshell
Today, Hadoop has grown from its monolithic beginning to be a software library, a framework to develop applications that require distributed processing of massive amounts of data lying across clusters of computers using simple programming models. It can scale up from single server to thousands of machines. The idea is to distribute computation and storage across multiple computers to leverage the processing of large sets of data. The library has the capability to detect failures at the level of the application layer so that the programmer can handle them and deliver service on top of a cluster of computers rather than permeate down the failure to one or more lower levels where it becomes more difficult to manage or overcome.
Hadoop, therefore, is a combination of tools and open source libraries supported by Apache for creating applications for distributed computing which is highly reliable and scalable.
How It Works
There are three ways Hadoop basically deals with Big Data:
- The first issue is storage. The data is stored in multiple computing machines in a distributed environment where they can be processed in parallel to reduce time and resources. The data is persisted in an environment called Hadoop Distributed File System (HDFS), which is used to store data in multiple formats across clusters of machines. For this purpose, it divides data into blocks and stores it across different data nodes. It uses a technique called horizontal scaling to add extra data nodes to existing HDFS clusters according to the requirement. This maximizes the utilization of existing resource instead of adding one whenever the need to scale up arises.
- The second issue is to accommodate the variety of data. HDFS is equipped to store all kinds of data, be it structured, semi-structured, or unstructured. There is no pre-dumping schema validation. Data, once written, can be read multiple times without any problem.
- The third issue is of processing and how to access the stored data. In this regard, The MapReduce algorithm come to the rescue, where processing is distributed across slave nodes to work in parallel and the result is sent back to the master node. The master node merges the results before furnishing the final outcome. This part is handled by the YARN, which is designed for parallel processing of data stored in HDFS.
There are many intricate parts to it, but this is what Hadoop does in a nutshell. The idea of modules would give further insight.
Apache Hadoop Project consist of six modules. The first four are as follows:
- Hadoop Common: This consists of utilities commonly used by Hadoop and supports other Hadoop modules. It is also known as Hadoop Core and is an essential part of the ecosystem, along with HDFS, YARN, and MapReduce. It is in this part that Hadoop presumes that hardware is prone to failure and any necessary means are provided for a programmer to automatically handle failure in software.
- Hadoop Distributed File System (HDFS): A distributed file system that can accommodate a variety of files across a distributed environment. It breaks the files into blocks and stores them across nodes in a distributed architecture. It provides horizontal scaling instead of vertical scaling, to additional clustering up. It is highly fault-tolerant and low cost in terms of hardware deployment capabilities.
- Hadoop YARN: This is the CPU of the Hadoop framework. With two major components, called NodeManager and ResourceManager, YARN performs all the processing activities such as resource allocation, task scheduling, and cluster management.
- Hadoop MapReduce: This is a framework to do all the parallel computation. MapReduce is a parallel programming model to process data in a distributed environment. It is ideally used to write distributed applications that can efficiently process large amounts of data across clusters of commodity hardware. It segments the process into two phases, called Map and Reduce, where the task of the Mapper class is to take the input, tokenize, map, and sort it. The output then becomes the input to the Reducer class, which searches matching pairs and reduce them. There are key-value pairs for input and output in each phase, and the type of the pair is determined by the programmer.
Two new sub-projects are added recently:
- Hadoop Ozone: It is a scalable, redundant, and distributed object store for Hadoop. Apart from scaling to billions of objects of varying sizes, Ozone can function effectively in containerized environments such as Kubernetes and YARN. It is built on a highly available, replicated block storage layer called Hadoop Distributed Data Store (HDDS). [An excerpt. Click to find more.]
- Hadoop Submarine: A machine learning engine for Hadoop. It is a project that allows an infra engineer/data scientist to run deep learning applications (Tensorflow, Pytorch, and so forth) on a resource management platform (like YARN). [An excerpt. Click to find more.]
Hadoop has made a significant impact on the searches, in the logging process, in data warehousing, and Big Data analytics of many major organizations, such as Amazon, Facebook, Yahoo, and so on. It is a one stop solution for storing a massive amount of data of any kind, accompanied by scalable processing power to harness virtually limitless concurrent jobs. In a few words, the popularity of Hadoop owes much to its fault-tolerant, scalable, cost-effective, and fast capability.