In earlier articles, we discussed end-to-end Business Intelligence (BI) systems and their architecture with all important components. We know a BI system helps the user community understand the business trends and organization health. In today's world, data size is increasing and changing its data paradigm; this also opens a new door for advance data analytics and the area that gives business insight.
In this article, we'll discuss how the new "Hadoop" framework is different from conventional BI and is becoming the future BI platform to handle the huge size of data with faster processing time.
Hadoop: an Understanding
We discussed a lot about handling a large volume of data using BI systems, but now data size is increasing drastically; this is hard to handle using conventional BI and to meet business objectives. But as the data paradigm is changing, the way of data handling is also changing to help the user community.
Now, data is becoming big: huge volume, various varieties, and high velocity of data including un-structured behavior; to handle all these, Apache introduced a framework called HADOOP. Hadoop is not only a framework to handle 3 Vs type of data; it also gives a lot of other benefits due to which it is becoming a most influential buzz word in industry.
Hadoop is evolving, but most organizations have started to think beyond proof of concept. Hadoop has been introduced to handle their data and get benefit out of it, like use of less expensive commodity hardware, distributed parallel processing, high availability, and so forth. The Hadoop framework design supports a scale-up approach where data storage and computation can happen on each commodity server.
We'll discuss more detail about Hadoop in the following sections.
We discussed various components of conventional BI systems in previous articles. Today, we'll discuss the overall Hadoop architecture and its various components.
"Hadoop is a framework that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer."
Figure 1 depicts a high-level understanding about the Hadoop ecosystem.
Figure 1: The Hadoop ecosystem
The Hadoop framework has four core capabilities: Hadoop common, Hadoop Distributed File System, Hadoop YARN, and Hadoop MapReduce.
- Hadoop Common: Hadoop provides a set of common utilities that support other Hadoop functionalities.
- Hadoop Distributed File System (HDFS): HDFS is a file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system.
- Hadoop YARN: Yet Another Resource Negotiator (YARN) assigns CPU, memory, and storage to applications running on a Hadoop cluster. The first generation of Hadoop could only run MapReduce applications. It is a pluggable architecture and resource management for data processing engines to interact with data stored in HDFS.
- Hadoop MapReduce: A YARN-based batch processing system which supports parallel processing of large unstructured data sets.
Here we discussed the core capabilities of a Hadoop system. In the next section, we'll discuss various components of the Hadoop framework.
Data Storage In Hadoop
Hadoop Distributed File System (HDFS) provides an optimal and reliable way to store data in the overall Hadoop framework. HDSF works as a backbone in Hadoop implementation. In HDFS, actual data store on a cluster or set of less expensive commodity servers make a family of cluster. A HDFS cluster family has one Namenode and one-to-many Datanode. Namenode is a master node that maintains the file system's metadata and Datanode is a slave node where the actual data is stored. In a Hadoop application, whenever source data reaches HDFS, it interacts with Namenode first to get file metadata information; then, data read/write happens directly on Datanode.
Figure 2, referred from the HDFS user guide, will help you to understand the HDFS architecture and data flow within a Hadoop application.
Figure 2: The HDFS architecture and data flow
HDFS is a highly configured, distributed data storage and processing unit. It provides all architectural characteristics, such as scalability, flexibility to store any type of data (structured & unstructured), and fault tolerant.
Understanding Hadoop Components
The Hadoop framework consists of various sets of components that give you multiple ways to implement the solution with available technical expertise or build with a new one. These components help manage data storage, high performing I/O, data analytics, and system monitoring. Here are some of the components of the Hadoop framework in alphabetical order: Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, Spark, Sqoop, Tez, and Zookeeper, and so on.
- Ambari: Ambari is an UI interface tool that gives the opportunity to provision, monitor, and manage Hadoop clusters. It's a web-based, user-friendly tool and provides system health status on dashboards and sends alerts whenever attention is needed.
- Avro: Avro is a data serialization system that comes with a rich data structure and provides a compact, fast, and binary data format. It supports remote procedure calls.
- Cassandra: Cassandra is a highly scalable and available database that is a useful platform for mission-critical data. Cassandra provides very high performance using column index, materialized view, built-in caching, and de-normalization.
- Chukwa: Chukwa is a a data collection system for monitoring and analyzing large distributed systems. It's built on top of Hadoop, an open source file system and MapReduce implementation, and inherits Hadoop's scalability and robustness. Also, its provides a flexible toolkit for displaying, monitoring, and analysis results to make the best use of collected data.
- HBase: HBase is a non-relational database that can be used when random and realtime read/write data access is needed for Big Data. HBase runs on top of HDFS and provides fault-tolerant storage.
- Hive: Hive is data warehouse software in the Hadoop world. It's built on top of the MapReduce framework, which gives a way to execute interactive SQL queries over massive amounts of data in Hadoop. Hive supports QL, a SQL-like query language to query the data and is called HiveQL.
- Mahout: Mahout is a tool in the Hadoop framework that helps to find a meaningful pattern in huge data sets. Mahout is "A Scalable machine learning and data mining library."
- Pig: Pig is an extensible, optimized high-level language (Pig Latin) to process and analyze huge amount of data. Pig Latin can be extended by using User Defined Functions and defines a set of data transformations like data aggregation and sort. Also, the program is written in Pig Latin paired with the MapReduce framework to process these programs.
- Spark: Spark provides a data computing facility; it implements fast, iterative algorithms with the use of a robust set of APIs that process large scale data 100x faster than Hadoop MapReduce in memory. Spark supports all type of applications, machine learning, and stream processing, including graph computation and large size ETL.
- Sqoop: Sqoop is a tool that helps data transfer between Hadoop and relational databases (structure data stores). Also, Sqoop imports from relational databases into HDFS or related systems like Hive and HBase. It supports parallel data loading for enterprise data sources.
- Tez: Tez provides a powerful and flexible engine to execute an arbitrary directed acyclic graph (DAG) of tasks to process data of both batch and interactive types. Tez is a generalized MapReduce paradigm to execute complex tasks for big data processing and is built on Hadoop YARN.
- Zookeeper: Zookeeper is an operational tool that provides distributed configuration services, group services, and a naming registry for distributed systems. Zookeeper provides a fast and reliable rich interface.
With these sets of components, we can understand the Hadoop paradigm and how these components can help develop an end-to-end BI solution to handle massive data.
Analytics Using Hadoop
The Hadoop framework not only provides you the way to store and manage high volume of data but also opens new dimensions in the data analytics world. Data analysts use Hadoop components to explore, structure, and analyze the data, and then turn it into business insight. Hadoop allows analysts to store data in any format, and then create schema when needed to analyze rather than write, which transforms into specified schema upon load, like conventional BI solution.
Hadoop extends conventional business decision making with solutions that increase the use and business value of analytics throughout the organizations. Also, it represents transference in the way that increases elasticity and provides a faster time to value because data doesn't have to be modeled or integrated into an EDW before it can be analyzed.
In this article, we discussed how Hadoop is changing the handling of massive data and faster response time with lower-cost hardware; with these characteristics, it gives the opportunity to analyze the new dimensions of data to get business insight that was untouched in conventional BI. Business users are considering Hadoop as a future BI platform and investing in it for building an enterprise-level solution using Hadoop components.
About the Author
Anoop worked for Microsoft for almost six and half years and has 12+ years of IT experience. Currently, he is working as a DW\BI Architect in one of the top Fortune Companies. He has worked on the end-to-end delivery of enterprise-scale DW\BI projects. He has a strong knowledge of database, data warehouse, and business intelligence application design and development. Also, he worked extensively on SQL Server, designing of ETL using SSIS, SSAS, SSRS, and SQL Azure.
Disclaimer: I help people and businesses make better use of technology to realize their full potential. The opinions mentioned herein are solely mine and do not reflect those of my current employer or previous employers.