Agile software development emphasizes continuous delivery of valuable software, simplicity, working software, customer collaboration, responding to change, and welcoming changing requirements. In this article, we shall discuss how Apache Hadoop (one of the top open-source projects)-based software fulfills Agile software development principles. The two main components of Hadoop are MapReduce for data computation and Hadoop Distributed File System (HDFS) for data storage and data processing.
Individuals and Interactions, and Customer Collaboration
Individuals and customers are involved in every aspect of the design of a Hadoop cluster, such as:
- Designing cluster size and capacity, including volume of data, and number of nodes
- Choosing commodity hardware for node machines, racks, network switches
- Designing network bandwidth
- Choosing the type of workload
- Choosing data compression methodology
- Choosing a data retention policy
Early and Continuous Delivery of Valuable Software
Apache Hadoop is a highly available, fault-tolerant, distributed framework designed for the continuous delivery of software with negligible downtime. HDFS is designed for fast, concurrent access to multiple clients. HDFS provides parallel streaming access to tens of thousands of clients. Hadoop is a large-scale distributed processing system designed for a cluster consisting of hundreds or thousands of nodes, each with multi-processor CPU cores. Hadoop is designed to distribute and process large quantities of data across the nodes in the cluster. If a machine fails, the failed computation is automatically started on another machine. Hadoop is designed to process web-scale data in the order of hundreds of GB to 100s of TB, to several PB. Hadoop makes use of a distributed file system that breaks up the input data and distributes the data across several machines in a cluster and processes the distributed data in parallel. Processing data in parallel is much faster than processing data sequentially.
HDFS stores data reliably. In a large-scale distributed cluster, failure of some of the nodes is expected and failure of a few nodes does not cause the data to become unavailable. Block replication provides data durability. Each block of a file is replicated at least three times by default. If a single block were to be lost due to node failure, a block replica is used and the block is quickly re-replicated to restore the replication level.
With the HDFS Namenode High Availability feature, the addition of the Active NameNode-Standby NameNode configuration can be used to avoid a single point of failure (SPOF).
Apache Hadoop is designed specifically for large-scale computations of large datasets, such as:
- Market trends data
- Sales data
- Consumer preferences data
- Social media data
- Search engine data
If your use case meets the requirements of distributed, large-scale computations, Hadoop is simple to use. Computation is distributed across a set of machines instead of using a single machine. If your requirements are data analysis that involves small datasets, and a large number of small files, with real-time or low latency tasks, Hadoop is not the framework to use. Hadoop is not a database, nor does Hadoop replace traditional database systems. Hadoop is designed to cover the scenario of storing and processing large-scale data, which most traditional systems do not support or do not support efficiently.
One of the simple design features of the HDFS is data locality. The computation code is moved to where the data is rather than moving the data to the computation. Data locality has a definite advantage as HDFS is designed for data-intensive computation, and it is much more efficient to move the computation to the data; the computation code being relatively small. Having to move fewer data utilizes less network bandwidth and as a result, increases the overall efficiency and throughput of the system.
Hadoop is designed for data coherency with an HDFS that is based on a write-once-read-many model. What that implies is that HDFS does not support changes to the stored data other than appending new data. A MapReduce application or a web search application could benefit from the write-once-read-many model for high throughput data access.
HDFS is designed for:
- High throughput of data rather than low latency tasks
- Batch processing rather than interactive user access
- Large streaming, sequential reads of data rather than random seeks to arbitrary positions in files
Another simplicity-related feature is that HDFS is designed to support unstructured data. Data stored in HDFS does not need to conform to a predefined schema.
HDFS is highly accessible with support for multiple interfaces including Java and C/C++ with MapReduce for C (MR4C)—an open-source framework, HTTP browser, and command-line FS Shell.
You may assume that such a large-scale, distributed, and highly available framework must make use of custom, specially designed hardware, but Hadoop relies on off-the-shelf reliable, commodity hardware with the priority being working software. Large-scale computation requires that the input data be distributed across multiple machines and processed in parallel on different machines. HDFS overcomes data loss with data replication. How does Hadoop manage to store large files? A file is broken into blocks and stored across a cluster with a single file having blocks on multiple nodes (machines). The blocks are replicated. As blocks are distributed and replicated it provides fault tolerance; the failure of a single node does not cause data loss. A single file in the HDFS can be larger than the size of a single disk on the underlying filesystem because a file is broken into blocks. A file’s blocks are stored over multiple disks in the cluster. Replicating and distributing a file over multiple nodes provides high availability of data for the data locality, which was discussed earlier.
Hadoop adapts well to the reliability demands of distributed systems that support large-scale computational models because Hadoop:
- Supports partial failure rather than complete failure.
- Supports data recoverability. If some components fail, their workload is taken up by the functioning machines.
- Supports individual recoverability. A full cluster restart is not needed. If a node fails, it is able to restart and rejoin the cluster, or a replacement node is used.
- Provides consistency with concurrent operations.
As HDFS is designed to be used on commodity hardware, it is portable across heterogeneous hardware platforms. HDFS’s software components are built using Java, a portable language with multi-platform support. HDFS network communication protocols are built over TCP/IP, the most common protocol.
Responding to Changing Requirements
Hadoop is highly scalable and responds to changing requirements quickly. Hadoop is designed for both unstructured and structured data, which makes it suitable for the web as the type of data generated on the web is also both structured and unstructured. The distributed system is able to take an increased load without performance degradation. An increase in the number of machines provides a proportionate increase in capacity. Hadoop scales linearly. A Hadoop cluster could scale from a 12-machine cluster to a cluster with hundreds and thousands of machines without any degradation in performance.
With such an Agile framework as Apache Hadoop, it is no wonder that Hadoop has more committers than any other Apache project.