When we talk about Hadoop we talk about Big data which is defined by 4 V’s :
- Velocity, Variety, Veracity and Volume.
Type of Data that Hadoop handles are :
- Structured, Semi structured and unstructured.
Hadoop supports and is defined by following features :
- An Open Source framework for processing large data sets.
- It is built for commodity hardware ,removing the need for exclusive hardware.
- Has high fault tolerance
- Highly scalable.
- Default replication of 3.
Components of Hadoop
| | | |
Hadoop common HDFS YARN MapReduce
Hadoop Common : This module consists of utilities that support Hadoop.
HDFS (Hadoop Distributed File System): HDFS is a distributed file system that has master/slave architecture .It has two types of nodes , a NameNode and one or more instances of DataNode(s).NameNode stores the metadata and log of all the data living on the DataNodes whereas DataNodes stores all the data in blocks.
YARN(Yet Another Resource Navigator): Yarn is the component that takes over the role of resource manager and job scheduler.This framework comprises of the Resource manager as well as the Node manager.
MapReduce: MapReduce is the software framework that can handle the parallel processing of large data set on large cluster of HDFS.The MapReduce use key value pair to complete processing.The steps involved in MapReduce mostly involve Map, Shuffle, Reduce .Here is an example for better understanding.
(You can read in detail about mapreduce with example here : Apache’s MapReduce page )
If you think you can add to the topic, feel free to comment below.
Reference: Apache Hadoop