Apache Hadoop : A Brief Overview
Oct 10, 2023by, Dexlock
Apache Hadoop is an open-source framework that was the first to be developed for distributed computation and distributed storage. It is used to effectively store and analyze massive datasets with sizes ranging from gigabytes to petabytes of data. Hadoop enables clustering multiple computers to examine big datasets in parallel more quickly than using a single powerful computer for data storage and processing. From a single server to thousands of devices, each providing local computing and storage, Hadoop was intended to increase scalability. The library is designed to identify and handle problems at the application layer rather than relying on hardware to provide high availability. As a result, a highly-available service is delivered on top of a cluster of computers, each of which may be susceptible to failures.
Hadoop comprises of five main modules:
- Hadoop Common – Offers generally applicable Java libraries that can be utilized by all modules.
- HDFS or Hadoop Distributed File System: A distributed file system that functions on standard or inexpensive hardware. Compared to conventional file systems, HDFS offers higher data speed, superior fault tolerance, and native support for huge datasets.
- YARN or Yet Another Resource Negotiator: Manages and keeps track of resource utilization and cluster nodes. It plans out chores and tasks.
- Hadoop MapReduce: A framework that aids in the parallel processing of data by programs. Data from the input are transformed into a dataset that can be computed in key-value pairs by the map task. Reduce tasks use the output from the map task to combine output and produce the desired outcome.
- Hadoop Ozone – An object store for Hadoop introduced in 2020
Working of Hadoop
The Hadoop Distributed File System (HDFS) for storage and the MapReduce programming language for processing make up Apache Hadoop’s fundamental components. Hadoop divides files into significant blocks and disperses them among cluster nodes. It then distributes packaged code to nodes so they can process the data concurrently. This process makes use of data locality by allowing nodes to modify the data they have access to. This makes it possible to handle the dataset more quickly and effectively than would be possible with a more traditional supercomputer architecture that depends on a parallel file system and distributes computation and data using high-speed networking.
Hadoop’s building blocks allow for the construction of different services and applications that gather data in various formats and can add data to the Hadoop cluster by connecting to the NameNode through an API function. The “chunk” placement of each file and file directory arrangement are tracked by the NameNode and copied across DataNodes. A MapReduce job made up of numerous map and reduce tasks that execute against the data in HDFS distributed across the DataNodes run a job to query the data. Each node runs a map task against the specified input files, and reducers run to aggregate and arrange the output.
To collect, store, process, analyze, and manage big data, a variety of tools and applications are included in the Hadoop ecosystem namely, Apache Spark, Presto, Hive and Hbase.
Benefits of using Hadoop
- Data processing and computing power: Hadoop can store and process large amounts of data really fast. Big data is quickly processed via Hadoop’s distributed computing model. You can use additional computing nodes to have more processing power.
- Fault Tolerance: Applications and data processing are protected against hardware failure. Jobs are promptly forwarded to other nodes if one goes down to ensure that distributed computing does not fail. The data is automatically stored in multiple copies.
- Flexibility of data: Contrary to conventional relational databases, no preprocessing is necessary before saving the data. Unlimited amount of data can be saved including unstructured data like text, images and videos.
- Low cost and high scalability: Being an open-source framework it is free and runs on inexpensive hardware. It also requires only minimal management. You may quickly expand your system to process more data by adding nodes.
Challenges of Hadoop
Not every issue is a suitable fit for map-reduce programming. It works well for straightforward information requests and issues that can be broken down into smaller, independent problems, but it is ineffective for interactive and iterative analytical processes. Iterative algorithms require numerous map-shuffle/sort-reduce phases to finish since the nodes can only communicate with one another through sorts and shuffles. For advanced analytical computing, this creates many files in between MapReduce phases and is wasteful.
Hadoop administration appears to be a mix of art and science that necessitates a basic understanding of hardware, operating systems, and Hadoop kernel settings. Although new tools and technologies are emerging, fragmented data security issues remain a challenge. The Kerberos authentication system is a significant development in the direction of securing Hadoop settings. Hadoop also lacks user-friendly, feature-rich tools for metadata, governance, and data management.
The Hadoop modules are built on the fundamental idea that the framework should take care of hardware malfunctions on an autonomous basis as they can occur frequently. Hadoop was originally envisioned for computer clusters constructed from inexpensive hardware, which is still its most popular application. Have a project in mind that includes the Hadoop framework? Get in touch with us here.
Disclaimer: The opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Dexlock.