If you are here , I am assuming you are already familiar with the four V’s of Big data. You know the ABC of Big data and now your hands are itching to get working on the “Hadoop Ecosystem”. That is the term you will be hearing for all the small and big projects around Hadoop like Hive, Spark, Storm etc. Let’s go in detail about this topic later.
So, jumping to the very first question that popped up when I started learning Hadoop and even still comes up for discussion when I start a new project or Proof of Concept is “WHICH one do I learn or use? “.
The major distributions systems for Hadoop that are most popular in the industry currently are :
You can always install and use the Apache version of Hadoop and install all the component you require or want to learn individually.Personally I like that as it gives me freedom to manipulate the configuration properties of not only Hadoop but each of the component and learn them in depth . Though it does have it’s limitations:
1)The installations need admin to be comfortable with the command line OS (ubuntu or Mac) as their is no GUI to control the ecosystem unlike the commercial distributions. 2) It can be more for the purpose of learning .3)The versions of all project and their compatibility need to be kept in check , sMe goes for any upgrades. Automatic upgrades can easily upset the existing configurations.
Each of these distributions have their pros and cons.So you choose one based on the purpose and ease of use.
1)It is the most popular system.Thus you’ll easily find it everywhere.2) The Cloudera manager allows user to control and monitor everything from one place. 3) Also, new Cloudera on cloud has opened new avenues with the new cloud platform like Amazon EC2. 4)Even the courses and certification from Cloudera are readily available and widely accepted.Although they are a bit pricey and you have to keep updating your certifications. Although the recent change to the practical exam is good and it will test your entire knowledge gain of Hadoop ecosystem rather just theory questions. 5) The installation is easy as most of the configuration is done by the installer.
1) Similar to Cloudera, MapR also is an already established system . 2) MapR has all default configurations setup ,they do not even show up in most of configuration xml. 3) MapR does have it’s own file system and DB and also allows the Hadoop to connect to NFS which is a desirable feature for the exiting application systems.4) The certification material is readily available and the certification is not too costly.5) But whoever there is a new development in any interacting tool , the MapR plugin is generally are the last ones among the three.6) It allows developers and admins to control and maintain the nodes using both the graphical interface MCS and the command line interface.
1) Hortonworks does occupy a bigger portion of market then MapR but smaller then cloudera.2) Hortonworks has really nice GUI .3)The enterprise ready HDP makes the Hortonworks package quite attractive to even the developers and Hadoop administrators. 4)The Certification from Hortonworks are all hands on which it gives quite credibility .
Apart from these when the discussion is for a project the selection end up based on which Apache projects the company will be needing and thus becomes one of the deciding factors.Another important factor that end up deciding the which distribution you will be using are the current products that are going to be plugged in with the Hadoop ecosystem that you are planning to develop. Not every company products has plug in ready for all the above three platforms.
At the end of the day it is all depends on you and your projects and the choice changes according to your answers.
If you think you can add to the topic, feel free to comment below.