The core technique of storing files in storage lies in the file system that the operating environment uses. Unlike common filesystems, Hadoop uses a different filesystem that deals with large datasets across a distributed network. It is called Hadoop Distributed File System (HDFS). This article introduces the idea, with related background information to begin with.
What Is a Filesystem?
A filesystem typically is a method and data structure the operating system uses to manage files on a disk or partition. From the perspective of a magnetic disk, every data is a charge stored in sectors across tracks. Think of tracks as spiral rows and sectors as the tiny cells across the spiral tracks. Now, if we request the disk to locate some data, it, at best, can re-direct its head to some sectors in the spiral sequence. This raw data is not meaningful unless the operating system comes into the picture; it is in charge of delimiting the information from a collection of sectors to be recognized as a file. An operating system organizes the information into a bookkeeping data structure called a filesystem. This structure defines the bookkeeping pattern. But, there is some technical difference about how OSes manage this structure. For example, Windows uses the FAT32, NTFS model, Linux uses EXT2, EXT3, and so forth. But, the basic idea is that they all organize the data according to some defined structure.
Filesystem organization is primarily responsible for managing the creation, modification, and deletion of files (directories are also files), disk partition, file sizes, and so on, and they directly operate on raw sectors of a disk or partition.
Files in a Distributed System
The characteristics of a distributed system are different in the sense that the storage is strewn across multiple machines in a network. A single repository cannot contain such a large amount of data. If a single machine has a limited storage capacity and processing power, but, when the processing job and storage is distributed among machines across the network, the power and efficiency become manifold. This not only opens up the possibility of extensive processing power but also leverages the use of the existing infrastructure. This result is that the cost is minimized, yet efficiency is increased. Every single machine in the network becomes a potential workhorse that houses limited data while collectively being a part of unlimited store and extensive processing power. The tradeoff is complexity. If that can be harnessed with innovative techniques, a distributed system is excellent to deal with the problems of big data. The HDFS filesystem aims to achieve that. If fact, beyond HDFS, there are many other similar distributed file systems, such as IBM's GPFS (General Parallel File System), Ceph, (Wikipedia link: list of distributed file systems), and the like. They all try to address this issue from various directions with varied success rates.
The normal filesystem was designed to work on a single machine or single operating environment. The datasets in Hadoop require storage capacity beyond what a single physical machine can provide. Therefore, it becomes imperative to partition data across a number of machines. This requires a special process to manage the files across the distributed network. HDFS is the file system that specifically addresses this issue. This filesystem is more complex than regular a filesystem because it has to deal with network programming, fragmentation, fault tolerant, compatibility with local file system, and so forth. It empowers Hadoop to run Big Data applications across multiple servers. It is characterized by being highly fault tolerant with high data throughput across low-cost hardware. The objective of HDFS file system is as follows:
- To deal with very large files
- The streaming data access to the file system must leverage a write once and read many times pattern.
- Run on inexpensive commodity hardware
- It must leverage low latency data access.
- Support a massive number of files
- Support multiple file writers with arbitrary file modification
A smallest amount of data that is read and written onto a disk has something called block size. Typically, the size of this block is 512 bytes and file system blocks are a few kilobytes. HDFS works on the same principle, but the size of the block is much larger. The larger block size leverages the search by minimizing seeks and therefore cost. These blocks are distributed throughout something called clusters, which are nothing but blocks and copies of blocks on different servers in the network. Individual files are replicated across servers in the cluster.
There are two types of nodes operating in the cluster in a master-slave pattern. The master node is called namenodes and the worker node is called datanodes. It is through these nodes HDFS maintains the file (and directory) system tree and metadata. In fact, a file is split into blocks and stored in a subset of datanodes to spread across the cluster. The datanode is responsible for read, write, block creation, deletion, and replication requests in the file system.
The namenodes, on the other hand, are servers that monitor access to the file system and maintain data files in the HDFS. They map blocks to the datanode and handles file/directory open, close, and rename requests.
Datanodes are the core part of the filesystem and do the job of storage and retrieval of block requests from the client. Namenode is the maintainer to whom datanodes report. This means that if namenodes are obliterated, the information about the files would be lost. Therefore, Hadoop makes sure that the name node is resilient enough to withstand any kind of failure. One technique to ensure that is to back it up in a secondary namenode by periodically merging the namespace image with the edit log. The secondary namenode usually resides on a separate machine to take over as the primary namenode in case of a major failure.
There are many ways to interact with the HDFS filesystem, but the command line interface is perhaps the simplest and most common. Hadoop can be installed onto one machine and run to get a firsthand taste of it. we'll cover that in subsequent articles, so stay tuned.
The HDFS filesystem operations are quite similar to the normal filesystem operations. Here are some listings just to give an idea.
Copies files from the local filesystem to HDFS:
% hadoop fs -copyFromLocal docs/sales.txt hdfs://localhost/ user/mano/sales.txt
Creates a directory in HDFS:
% hadoop fs -mkdir students
Lists files and directories in the current working directory in HDFS:
% hadoop fs -ls .
HDFS is an implementation of what a filesystem represented by Hadoop's abstraction does. Hadoop is written in Java; hence, all filesystem interactions are interceded through the Java API. The command line interface is a shell provided for common interactions. The study of HDFS opens a different horizon to the sector of distributed architecture and its intricate working procedures. A lot of work is going on to perfect this model of computing, of which the impetus undoubtedly being Big Data in recent years.