Hadoop – Stand out from the crowd!

We all know the History and Evolution of Hadoop. Now I will try to explain some key features of Hadoop that made Haddop to stand out from the crowd.


Let me talk about Hadoop Scalability first. Hadoop is linearly scalable. When I said Hadoop is linearly scalable , let me give an example to explain that.

Lets say I have two cars, one is Black and another is Red. These two cars are giving 15 kilometers mileage for One liter of Octane. But I want 30 kilometers mileage with the same fuel. So in order to achieve that 30 KM , I have increased the configuration of these two cars. I have exactly doubled the configuration of those cars. But I found that the Black one did not achieve 30 KM, not even 20 KM 😦
From the example the Black car can be compared with RDBMS and Red one with Hadoop.


Hadoop is something which uses distributed file system which distributes the work among different file system. So rather than using a single file system we are using distributed file system and distributing the work among them. So in this case I am increasing my resources rather than having a single resource I am having multiple resources. So a questing might popup , is that not a problem with the budget? Increase of resources will increase the budget which can be a burden to my client.

The nodes in Hadoop clusters are made-up of commodity hardware.

What does Commodity Hardware mean?

Commodity hardware is a term for affordable devices that are generally compatible with other such devices. In a process called commodity computing or commodity cluster computing, these devices are often networked to provide more processing power when those who own them cannot afford to purchase more elaborate supercomputers, or want to maximize savings in IT design.


When I will compare enterprise hardware with commodity hardware , it will be around 90% cheaper 🙂






How can I trust on Hadoop which stores lots of confidential and critical data on a cheaper hardware’s ?

The answer is YES!!! You can fully rely on Hadoop because it can take over of Auto Fail over of nodes. Hadoop architecture takes care of Auto Fail-over of your nodes. Lets say I have spitted my works among 10 people if One person has fall and sick. What I should do then ? I will route ask someone to do that task.

Should I assign that to anyone ?

No, I need to identify who has less work load to handle the additional task.

Hadoop architecture also do the same thing.


There are many reasons why Hadoop is flexible. Let me give two examples:
Firstly, as I said in the definition of Hadoop is a framework written in JAVA. And as we all know that, Java is the most powerful, portable, high ways across any operating system. So Hadoop should also be portable across any operating system.

The next thing is, Hadoop is written in Java but it’s not that all it’s programming models to be written in Java. You can write your programming model in Python, C, CPP or whatever you programming language you like. So Hadoop has lot of flexibility so that you can work your programming models in Hadoop.


Distributed And Fast:

And finally, let’s talk about the distributed behaviors of Hadoop. The distribution of work among different systems is the main feature of Hadoop and for which Hadoop got prominent. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.

So, now we all know all significant features of Hadoop.

Happy learning 🙂


History and Evolution of Hadoop

Lets talk about the Evolution of Hadoop. Doug Cutting the creator of Hadoop(Yahoo!) and Chief Architect of Cloudera.
Doug CuttingIn the Year of 2002-2004, Doug Cutting was working with Apache in a project called Apache Lucene and Nutch, a distributed search engine that suppose to index 1 billion pages. Lucene is a search indexer and Nutch is a spider or crawler.

What does that mean ? What are the basic things of a general search engine?

A search engine basically contains of three things :

  • A spider or crawler : downloads data whenever you search something over the search engine.
  • Indexer : indexes to the frequently used pages. If the people are using any web site for more number of time. Indexer will point to that.
  • Mapper : maps actual content to the screen.

In December 2004, Google Labs published a paper on the MapReduce(also called MR) algorithm. Doug Cutting found that the project he is working on is not scaling according to expectation. Then he decided to use the concept of MR for building Nutch distributed file system.

In 2006, Doug Cutting had joined Yahoo! And Yahoo had provided some dedicated team to work on a Project called Hadoop!

Checkout the story behind the name :-).

100762110-hadoop.530x298           Source: Doug Cutting, Doug Cutting and Hadoop the elephant

During 2006 – 2008, Hadoop was born out of Nutch as a Large Scale Distributed Computing platform!Which would scale upto multiple number of machines.

By the end of year 2008, Yahoo declared it had 910 node clusters. And by using those it was able to sort One Terabyte of data  within 3.5 minutes. Previously it was taking at least a day to do that work.

So , we can say that Hadoop has got prominent by the Year 2008 !!!

Data Scientist and Data Engineer in the ideal world!

Though it is too early to differentiate between the two roles and responsibilities but still it is nice to have a little understanding of them. Most importantly, both of these roles are important in a well data science world!

Some how this is a common thing which many people get confused with. So in the ideal world Data Scientists are generally people who understand various statistical model and can find out how a problem can be solved using the data around. On the other hand, Data Engineers are the people who implement the ideas of the Data Scientist to create the technical architecture which would be a technical implementation of the solutions.

So now it would be clear that, skills required for Data Scientists are strong Mathematical knowledge and very good understanding of Statistical modeling with problem solving capabilities. Additionally a little skills of programming is also required to become an eligible member for this position.

On the contrary, skills expected from a Data Engineer would be a strong technical knowledge and programming skills and ability to formulate technical solutions. A little statistical knowledge would come in handy. Although in the real world there is a lot of overlap between the two roles. But what is to be understood is that, you do not grow from a Data Engineer to a Data Scientist or Data Scientist are more important. Data Scientist and Data Engineers have different roles and responsibilities and skill sets. So learning hadoop or any other similar tools and technologies doesn’t mean that you will be a Data Scientist but having a good exposure to other mathematical skills and knowledge would be a bigger strength in order to become a Data Scientist.

101.datascience.community presented the difference by using an excellent Venn Diagram.


So when choosing career or hiring someone for this roles, please choose wisely and understand that they are different roles and responsibilities.