Hadoop Vs Traditional Data Processing Solutions

Why Hadoop and Why Can’t we stick Traditional Data Processing Solutions and RDBMS?


RDBMS is not designed to handle huge amount of data of different kind, they are designed to handled structured data. Cost and complexity involved in scaling Hadoop is very less as compared to RDBMS. Moreover the scaling and parallelism which Hadoop can achieve is nearly impossible for an RDBMS. There are lots of limitation in RDBMS which restricts it to be used for Big Data.

Why Hadoop and Why Can’t we stick Traditional Data Processing Solutions and RDBMS?

Detailed explanation to this question is explained in this post.This article will focus on: -
  • What limitations restrict the traditional RDBMS from being used to process BigData.
  • How the new Big Data solution can overcome the problems of traditional DBs and what feature it adds to it.

In this article you will get to know how exactly the traditional solution is different from Big data solutions and why they can’t be chosen over Hadoop for Big Data Processing.
Without further ado I will start with a comparative analysis between traditional DBs and Hadoop.

1. Better throughout in case of huge data volume:-

Hadoop processes the data faster when it comes to huge data volume. Based on the way RDBMS and Hadoop treats the data there are some striking differences between them.

Suppose there is a table of 1 terabytes size and you run a “select *” on the table. How differently it will be processed in Hadoop and RDBMS?

In RDBMS all the data blocks of the table required for the processing will be moved to the application server and then the logic will be applied on data, but in Hadoop not the data but the code is sent to the node where processing needs to happen thereby saving the time of data movement. 

If you have a table of 10GB (no index) then "select *" from emp will cause movement of 10 GB of data to DB server.  But in Hadoop the code which may be of 10o kb or less moves to all the nodes, data is already distributed among nodes. This is the most interesting and most important reason for better through put in Hadoop. Moving Terabytes and petabytes of data across network can itself take hours for transfer to complete. Typical RDBMS solution used in enterprise is shown below: -


Basic Diagram of 3 tier DB architecture

RDBMS has Application servers, Storage servers and the client is connected through high speed network lines where the data moves from storage to the application server.

Another difference in terms of how they treat the data is, RSBMS follows ACID rule but Hadoop follows BASE rule so it takes care of eventual consistency in contrast to the two way commit of RDBMS. 

2. Scalability
One of the most important feature of Hadoop environment is the ability to dynamically and easily expand the number of servers being used for data storage and with the addition of servers the power of computing it brings is commendable. You just need to modify the core-site.xml to inform the name node that there is a new member in the cluster. Traditional Dbs are also scalable but the problem lies in the way they scale. Vertical scaling makes RDBMS costly for processing large batches of data.

Traditional DBs works perfectly fine when you have small table with around 1-10 million rows. But when you grow to 500 million or petabytes of data, it becomes difficult to process the data.

They don’t scale well to very large, though grid solutions or sharding can help with this problem, but the increase in amount of data in RDBMS bring a lot of limitation which compromises the performance.

Some of the limitations which huge data volume brings on RDBMS are:
In RDBMS Scaling doesn’t increase the performance linearly. Scaling can be done in RDBMS. Processors, application servers, storage can be increased but the scaling doesn’t improve the performance linearly in RDBMS as shown in the diagram below: -


Performance Vs Scalability Graph for Hadoop and RDBMS

The graph shows Performance Vs Scalability of Hadoop and RDBMS. For Hadoop the graph is almost linear but for RDBMS a nonlinear improvement in performance can be seen when more servers are increased. Several reasons are there for Nonlinear performance improvement in RDBMS with Scaling. RDBMS follows ACID while Hadoop Follows BASE rule so it focuses more on the eventual commit rather than two-way commit. Predefined schema also hampers linear performance scalability in RDBMS. If you don’t have a flexible schema you are accelerating cost if you ever change your schema


Cost

Scaling up is very costly. If RDBMS is used to handle and store “big data,” it will eventually turn out to be very expensive. with increase in demand, relational databases tend to scale up vertically which means that they add extra horsepower to the system to enable faster operations on the same dataset.
On the contrary, Hadoop, NoSQL Databases like the HBase, Couchbase and MongoD, scale horizontally with the addition ofextra nodes (commodity database servers) to the resource pool, so that the load can be distributed easily.
The cost of storing large amounts of data in a relational database gets very expensive when data crosses certain range while the cost of storing data in a Hadoop solution grows linearly with the volume of data and there is no ultimate limit.
Hadoop Provides power of a Database with flexibility in storage

You can read any type of file and apply any kind of processing mechanism either treat it as a DB table and process through sql or treat it as a file and use any processing engine like spark, mapreduce and do any kind of analysis on them, be it predictive, sentiment, regression or real-time processing using lambda approach.

 “Once you put your data in RDBMS it become like a black box, you can access it only through the sql queries, Hadoop gives a sense of openness”. It provides tremendous flexibility in saving and processing of data the way we want”


Traditional Relational databases systems also has a limits on field lengths. Netezza a well known solution for handling and processing huge amount of data has a limitation of 1600 columns however in Hadoop there is no such limitation.

it is also difficult to implement certain kinds of use case such as the shortest path between two points, sentiment analysis, predictive analysis, using SQL and any free open source tool on top of a relational databases.


--by Sanjeev Krishna

0 comments:

Post a Comment

Manual Categories