RDBMS is not designed to handle huge amount of data of different kind, they are designed to handled structured data. Cost and
complexity involved in scaling Hadoop is very less as compared to RDBMS. Moreover the scaling and parallelism which Hadoop can
achieve is nearly impossible for an RDBMS. There are lots of limitation in
RDBMS which restricts it to be used for Big Data.
Why Hadoop and Why Can’t we stick Traditional Data Processing Solutions and RDBMS?
Detailed explanation to this question is explained in this post.This
article will focus on: -
- What limitations restrict the traditional RDBMS from being used to process BigData.
- How the new Big Data solution can overcome the problems of traditional DBs and what feature it adds to it.
In this article you will get to know how exactly
the traditional solution is different from Big data solutions and why they can’t
be chosen over Hadoop for Big Data Processing.
Without further ado I will start with a comparative analysis
between traditional DBs and Hadoop.
1. Better throughout in case of huge data volume:-
Hadoop processes the data faster when it comes to huge data volume. Based on the way RDBMS and Hadoop treats the data there are some striking differences between them.
Suppose there is a table of 1 terabytes size and you run a “select *” on the table. How differently it will be processed in Hadoop and RDBMS?
In RDBMS all the data blocks of the table required for the processing will be moved to the application server and then the logic will be applied on data, but in Hadoop not the data but the code is sent to the node where processing needs to happen thereby saving the time of data movement.
If you have a table of 10GB (no index) then "select *" from emp will cause movement of 10 GB of data to DB server. But in Hadoop the code which may be of 10o kb or less moves to all the nodes, data is already distributed among nodes. This is the most interesting and most important reason for better through put in Hadoop. Moving Terabytes and petabytes of data across network can itself take hours for transfer to complete. Typical RDBMS solution used in enterprise is shown below: -
If you have a table of 10GB (no index) then "select *" from emp will cause movement of 10 GB of data to DB server. But in Hadoop the code which may be of 10o kb or less moves to all the nodes, data is already distributed among nodes. This is the most interesting and most important reason for better through put in Hadoop. Moving Terabytes and petabytes of data across network can itself take hours for transfer to complete. Typical RDBMS solution used in enterprise is shown below: -
Basic Diagram of 3 tier DB architecture
RDBMS has Application servers, Storage servers
and the client is connected through high speed network lines where the data moves
from storage to the application server.
2. Scalability
One of the most
important feature of Hadoop environment is the ability to dynamically and easily expand the number of servers being used for data storage and with the addition
of servers the power of computing it brings is commendable. You just need to modify the core-site.xml to inform the name node
that there is a new member in the cluster. Traditional Dbs are also scalable but the problem lies in the way they scale. Vertical scaling makes RDBMS costly for processing large batches of data.
Traditional DBs works perfectly fine when you have small table with around 1-10 million rows. But
when you grow to 500 million or petabytes of data, it becomes difficult to
process the data.
They don’t
scale well to very large, though grid solutions or sharding can help with this
problem, but the increase in amount of data in RDBMS bring a lot of limitation
which compromises the performance.
Some of the
limitations which huge data volume brings on RDBMS are:
In RDBMS Scaling doesn’t increase the
performance linearly. Scaling can be done in RDBMS. Processors, application
servers, storage can be increased but the scaling doesn’t improve the
performance linearly in RDBMS as shown in the diagram below: -
Performance Vs Scalability Graph for Hadoop
and RDBMS
The graph shows Performance Vs Scalability of
Hadoop and RDBMS. For Hadoop the graph is almost linear but for RDBMS a nonlinear
improvement in performance can be seen when more servers are increased. Several
reasons are there for Nonlinear performance improvement in RDBMS with Scaling. RDBMS
follows ACID while Hadoop Follows BASE rule so it focuses more on the eventual
commit rather than two-way commit. Predefined schema also hampers linear
performance scalability in RDBMS. If you don’t have a flexible
schema you are accelerating cost if you ever change your schema
Cost
Scaling up is very costly. If RDBMS is used to handle and store “big data,” it
will eventually turn out to be very expensive. with increase in demand, relational databases tend to scale up
vertically which means that they add extra horsepower to the system to enable
faster operations on the same dataset.
On the contrary, Hadoop,
NoSQL Databases like the HBase, Couchbase and MongoD, scale horizontally with
the addition ofextra nodes (commodity database servers) to the resource pool,
so that the load can be distributed easily.
The cost of
storing large amounts of data in a relational database gets very expensive when
data crosses certain range while the cost of storing data in a Hadoop solution
grows linearly with the volume of data and there is no ultimate limit.
Hadoop Provides power of a Database with
flexibility in storage
You can read any type of file and apply any kind of
processing mechanism either treat it as a DB table and process through sql or
treat it as a file and use any processing engine like spark, mapreduce and do
any kind of analysis on them, be it predictive, sentiment, regression or
real-time processing using lambda approach.
“Once you
put your data in RDBMS it become like a black box, you can access it only
through the sql queries, Hadoop gives a sense of openness”. It provides
tremendous flexibility in saving and processing of data the way we want”
Traditional Relational databases systems also has a limits on field lengths. Netezza a well known solution
for handling and processing huge amount of data has a limitation
of 1600 columns however in Hadoop there is no such limitation.
it is also difficult to implement
certain kinds of use case such as the shortest path between two points, sentiment
analysis, predictive analysis, using SQL and any free open source tool on top
of a relational databases.
--by Sanjeev Krishna
0 comments:
Post a Comment