Hello guys today we going to learn about data processing with Hadoop, in this following article we we try to understand about Hadoop in simple language just to get a abstract about this huge topic.
What is Hadoop ?
Hadoop is an open-source software framework for storing data and running applications on collectivity of entity hardware. It provides massive storage for any kind of data, gigantic processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Developed by: Apache Software Foundation
Written in: Java language
Initial release date: 1 April 2006
Why Hadoop is used?
A wide variety of companies and organizations use it for both research and production. Users are encouraged to add themselves to the Hadoop
As it can store a huge amount of data with massive storage. For e.g. Facebook is a very big platform of social media, there are total With almost 2.5 billion monthly active users, What happens to storage if everyone uploads photos? In all this, it comes into action to save all this data with fast and safe.
○The 1st important feature offered by Hadoop is, it is a cost-effective system.
○it does not requires any expensive or specialized hardware, in order to be implemented.
○The next important feature on the list is, Hadoop supports a large cluster of Nodes.
○Therefore a Hadoop Cluster can be made up of 100’s and 1000’s of Nodes.
○One of the main advantage of having a large cluster is, offering More Computing Power and a Huge Storage system to the clients.
○The next important feature on the list is, it also supports Parallel Processing of Data, therefore the data can be processed simultaneously across all the nodes within the cluster, and thus saving a lot of time.
○The next important feature offered by Hadoop is Distributed Data. The Hadoop Framework takes care of splitting and distributing the data across all the nodes within a cluster. It also replicates the data, over the entire cluster.
○The next important feature on the list is Automatic Failover Management. In case if any of the nodes, within the cluster fails. The Hadoop Framework would replace that particular machine, with another machine, and it replicates all the configuration settings and the data, from the failed machine onto this newly replicated machine.
○Admins may need not have to worry about this, once the Automatic Failover Management has been properly configured on a cluster.
○The next important feature on the list is Data Locality Optimization. It is one of the most important features offered by the Hadoop Framework.
○The next important feature on the list is Heterogeneous Cluster. Even this can be classified as one of the most important features offered by Hadoop Framework.
○We know that a Hadoop Cluster is made up of several nodes.
○Basically Node is a technical term used to refer to a machine within the cluster.
○Let us try to understand, what do I mean by Heterogeneous Cluster.
○A Heterogeneous Cluster basically refers to a cluster, within which each node can be from a different vendor, and each node can be running a different version and flavour of the operating system.
Let us say our cluster is made up of 4 nodes ~~
○From Instance, the 1st node is an IBM machine running on Red Hat Enterprise Linux, the 2nd node is an Intel machine running on Ubuntu, the 3rd node is an AMD machine running on Fedora, and the last node is an HP machine running on CentOS.
○Even we the individual hardware components such as RAM and Hard Drive can be added or removed from a cluster on a fly.
Getting data into Hadoop
○Use Sqoop to import structured data from a relational database to HDFS, Hive and HBase.
○It can also extract data from it and export it to relational databases and data.
○Use linn to continuously load data from logs into Hadoop.
○Load files to the system using simple Java commands.
Author :- Sachin Vishwakarma
spread the world