Unlike any I/O subsystem, Hadoop also comes with a set of primitives. These primitive considerations, although generic in nature, go with the Hadoop IO system as well with some special connotation to it, of course. Hadoop deals with multi-terabytes of datasets; a special consideration on these primitives will give an idea how Hadoop handles data input and output. This article quickly skims over these primitives to give a perspective on the Hadoop input output system.
Data integrity means that data should remain accurate and consistent all across its storing, processing, and retrieval operations. To ensure that no data is lost or corrupted during persistence and processing, Hadoop maintains stringent data integrity constraints. Every read/write operation occurs in disks, more so through the network is prone to errors. And, the volume of data that Hadoop handles only aggravates the situation. The usual way to detect corrupt data is through checksums. A checksum is computed when data first enters into the system and is sent across the channel during the retrieval process. The retrieving end computes the checksum again and matches with the received ones. If it matches exactly then the data deemed to be error free else it contains error. But the problem is - what if the checksum sent itself is corrupt? This is highly unlikely because it is a small data, but not an undeniable possibility. Using the right kind of hardware such as ECC memory can be used to alleviate the situation.
This is mere detection. Therefore, to correct the error, another technique, called CRC (Cyclic Redundancy Check), is used.
Hadoop takes it further and creates a distinct checksum for every 512 (default) bytes of data. Because CRC-32 is 4 bytes only, the storage overhead is not an issue. All data that enters into the system is verified by the datanodes before being forwarded for storage or further processing. Data sent to the datanode pipeline is verified through checksums and any corruption found is immediately notified to the client with ChecksumException. The client read from the datanode also goes through the same drill. The datanodes maintain a log of checksum verification to keep track of the verified block. The log is updated by the datanode upon receiving a block verification success signal from the client. This type of statistics helps in keeping the bad disks at bay.
Apart from this, a periodic verification on the block store is made with the help of DataBlockScanner running along with the datanode thread in the background. This protects data from corruption in the physical storage media.
Hadoop maintains a copy or replicas of data. This is specifically used to recover data from massive corruption. Once the client detects an error while reading a block, it immediately reports to the datanode about the bad block from the namenode before throwing ChecksumException. The namenode then marks it as a bad block and schedules any further reference to the block to its replicas. In this way, the replica is maintained with other replicas and the marked bad block is removed from the system.
For every file created in the Hadoop LocalFileSystem, a hidden file with the same name in the same directory with the extension .<filename>.crc is created. This file maintains the checksum of each chunk of data (512 bytes) in the file. The maintenance of metadata helps in detecting read error before throwing ChecksumException by the LocalFileSystem.
Keeping in mind the volume of data Hadoop deals with, compression is not a luxury but a requirement. There are many obvious benefits of file compression rightly used by Hadoop. It economizes storage requirements and is a must-have capability to speed up data transmission over the network and disks. There are many tools, techniques, and algorithms commonly used by Hadoop. Many of them are quite popular and have been used in file compression over the ages. For example, gzip, bzip2, LZO, zip, and so forth are often used.
The process that turns structured objects to stream of bytes is called serialization. This is specifically required for data transmission over the network or persisting raw data in disks. Deserialization is just the reverse process, where a stream of bytes is transformed into a structured object. This is particularly required for object implementation of the raw bytes. Therefore, it is not surprising that distributed computing uses this in a couple of distinct areas: inter-process communication and data persistence.
Hadoop uses RPC (Remote Procedure Call) to enact inter-process communication between nodes. Therefore, the RPC protocol uses the process of serialization and deserialization to render a message to the stream of bytes and vice versa and sends it across the network. However, the process must be compact enough to best use the network bandwidth, as well as fast, interoperable, and flexible to accommodate protocol updates over time.
Hadoop has its own compact and fast serialization format, Writables, that MapReduce programs use to generate keys and value types.
Data Structure of Files
There are a couple of high-level containers that elaborate the specialized data structure in Hadoop to hold special types of data. For example, to maintain a binary log, the SequenceFile container provides the data structure to persist binary key-value pairs. We then can use the key, such as a timestamp represented by LongWritable and value by Writable, which refers to logged quantity.
There is another container, a sorted derivation of SequenceFile, called MapFile. It provides an index for convenient lookups by key.
These two containers are interoperable and can be converted to and from each other.
This is just a quick overview of the input/output system of Hadoop. We will delve into many intricate details in subsequent articles. It is not very difficult to understand the Hadoop input/output system if one has a basic understanding of I/O systems in general. Hadoop simply put some extra juice to it to keep up with its distributed nature that works in massive scale of data. That's all.
White, Tom. Hadoop, The Definitive Guide, 2009. O'Reilly Publications.