Free Platform to Discuss and Learn BigData

Search the topic you want to learn or discuss.

Introduction to Apache Sqoop

Sqoop is a tool which can transfer bulk data from a relational database to Hadoop and vise-versa. For better performance and optimal system utilization it does parallel data transfer and load balancing among the nodes. It can read/write data from/to Oracle, Teradata, Netezza, MySQL, Postgres, and HSQLDB. While importing the data to hdfs it can save the data in different format e.g. ORC, Avro, Parquet etc.

Hadoop Vs Traditional Data Processing Solutions

RDBMS is not designed to handle huge amount of data of different kind, they are designed to handled structured data. Cost and complexity involved in scaling Hadoop is very less as compared to RDBMS. Moreover the scaling and parallelism which Hadoop can achieve is nearly impossible for an RDBMS. There are lots of limitation in RDBMS which restricts it to be used for Big Data.

Why Hadoop and Why Can’t we stick Traditional Data Processing Solutions and RDBMS?

Detailed explanation to this question is explained in this post.This article will focus on: -
  • What limitations restrict the traditional RDBMS from being used to process BigData.
  • How the new Big Data solution can overcome the problems of traditional DBs and what feature it adds to it.

In this article you will get to know how exactly the traditional solution is different from Big data solutions and why they can’t be chosen over Hadoop for Big Data Processing.
Without further ado I will start with a comparative analysis between traditional DBs and Hadoop.

1. Better throughout in case of huge data volume:-

Hadoop processes the data faster when it comes to huge data volume. Based on the way RDBMS and Hadoop treats the data there are some striking differences between them.

Suppose there is a table of 1 terabytes size and you run a “select *” on the table. How differently it will be processed in Hadoop and RDBMS?

In RDBMS all the data blocks of the table required for the processing will be moved to the application server and then the logic will be applied on data, but in Hadoop not the data but the code is sent to the node where processing needs to happen thereby saving the time of data movement. 

If you have a table of 10GB (no index) then "select *" from emp will cause movement of 10 GB of data to DB server.  But in Hadoop the code which may be of 10o kb or less moves to all the nodes, data is already distributed among nodes. This is the most interesting and most important reason for better through put in Hadoop. Moving Terabytes and petabytes of data across network can itself take hours for transfer to complete. Typical RDBMS solution used in enterprise is shown below: -

Basic Diagram of 3 tier DB architecture

RDBMS has Application servers, Storage servers and the client is connected through high speed network lines where the data moves from storage to the application server.

Another difference in terms of how they treat the data is, RSBMS follows ACID rule but Hadoop follows BASE rule so it takes care of eventual consistency in contrast to the two way commit of RDBMS. 

2. Scalability
One of the most important feature of Hadoop environment is the ability to dynamically and easily expand the number of servers being used for data storage and with the addition of servers the power of computing it brings is commendable. You just need to modify the core-site.xml to inform the name node that there is a new member in the cluster. Traditional Dbs are also scalable but the problem lies in the way they scale. Vertical scaling makes RDBMS costly for processing large batches of data.

Traditional DBs works perfectly fine when you have small table with around 1-10 million rows. But when you grow to 500 million or petabytes of data, it becomes difficult to process the data.

They don’t scale well to very large, though grid solutions or sharding can help with this problem, but the increase in amount of data in RDBMS bring a lot of limitation which compromises the performance.

Some of the limitations which huge data volume brings on RDBMS are:
In RDBMS Scaling doesn’t increase the performance linearly. Scaling can be done in RDBMS. Processors, application servers, storage can be increased but the scaling doesn’t improve the performance linearly in RDBMS as shown in the diagram below: -

Performance Vs Scalability Graph for Hadoop and RDBMS

The graph shows Performance Vs Scalability of Hadoop and RDBMS. For Hadoop the graph is almost linear but for RDBMS a nonlinear improvement in performance can be seen when more servers are increased. Several reasons are there for Nonlinear performance improvement in RDBMS with Scaling. RDBMS follows ACID while Hadoop Follows BASE rule so it focuses more on the eventual commit rather than two-way commit. Predefined schema also hampers linear performance scalability in RDBMS. If you don’t have a flexible schema you are accelerating cost if you ever change your schema


Scaling up is very costly. If RDBMS is used to handle and store “big data,” it will eventually turn out to be very expensive. with increase in demand, relational databases tend to scale up vertically which means that they add extra horsepower to the system to enable faster operations on the same dataset.
On the contrary, Hadoop, NoSQL Databases like the HBase, Couchbase and MongoD, scale horizontally with the addition ofextra nodes (commodity database servers) to the resource pool, so that the load can be distributed easily.
The cost of storing large amounts of data in a relational database gets very expensive when data crosses certain range while the cost of storing data in a Hadoop solution grows linearly with the volume of data and there is no ultimate limit.
Hadoop Provides power of a Database with flexibility in storage

You can read any type of file and apply any kind of processing mechanism either treat it as a DB table and process through sql or treat it as a file and use any processing engine like spark, mapreduce and do any kind of analysis on them, be it predictive, sentiment, regression or real-time processing using lambda approach.

 “Once you put your data in RDBMS it become like a black box, you can access it only through the sql queries, Hadoop gives a sense of openness”. It provides tremendous flexibility in saving and processing of data the way we want”

Traditional Relational databases systems also has a limits on field lengths. Netezza a well known solution for handling and processing huge amount of data has a limitation of 1600 columns however in Hadoop there is no such limitation.

it is also difficult to implement certain kinds of use case such as the shortest path between two points, sentiment analysis, predictive analysis, using SQL and any free open source tool on top of a relational databases.

--by Sanjeev Krishna

Why Big Data Analytics had become a Buzzword Today

Big Data Analytics had become a buzzword today. Be it Insurance, Banking, Ecommerce or anything everyone is inclined towards learning or implementing Big Data.

Analytics basically is the quantification of the insight developed for the small phenomenon happening in the real life through proper massage of huge amount of data based on complex and cumbersome algorithm.

But Why Data Analytics is a hot topic in market?: -

    1. Data Spin off is wealthier than Data itself
Data is a magic wand or the tool which can conjure perfect strategy to venture on by identifying the pattern and relationship present inside it.

     2. Quantification of Insights for Business Growth
    Business is all about taking right decision and more importantly at the right time. The Quantification of the Insights derived from the data by identifying the pattern and relationship can prove to be reliable in solving problems and taking decisions.

3. Selling the insights
     Companies now-a-days are selling the insights to the vendors who actually need the information to drive their marketing strategy or to reach out to more and more targeted audience.

If you are naïve for this new technical jargon then several question may arise in your mind -
  • Why Big Data when we already have the well tested and reliable traditional RDBMS solution?
  • Why didn’t we have to deal with Big Data 5-10 years back, from where the data came all of a sudden?
  • Didn’t we have eCommerce website selling products online and generating huge amount data 10 years back for what good they are migrating to Big Data?

I will try to answer the above question one by one: -

Why didn’t we have to deal with Big Data 5-10 years back? 
Answer to this consists of the following two important points:-
  • Lack of easy, cost effective and faster processing engine to analysis terabytes of data of different variety. We had the solution like Netezza, Teradata etc but they are very costly and are difficult to scale up.
  • Since the advent of globalization the amount of data being generated has increased tremendously which will keep on growing with a tremendous speed.

The world has really producing data with an accelerating rate every year. The reason behind this surge is the growth in globalization. Data is doubling every two year and Stat from EMC on the expected growth of Data from 4.4 zettabytes in 2013 to 44 zettabytes in 2020.

The insights and the predictions made through the Analytics are never 100% accurate. The precision in prediction is directly proportional to the
  • Type and Amount of data.
  • Rationality of the algorithm applied.

So the more data you process the more precise your insights will be and Hadoop came as a cost effective fault tolerant processing solution in which industry can invest to achieve a through put where the investment is far less than the economic value of insights that they can get by processing and analyzing the data.

To Continue this topic click here Hadoop Vs Traditional Data Processing Solutions

--By Sanjeev Krishna

Basic Programming guide to begin with Apache Spark

When you plan to learn Apache spark the first thing which comes in mind: 

"How Much Programming one should know to begin with learning Apache Spark?"

Its commonly seen that all the Database developers are more inclined to learn big data, and in general they are more comfortable in writing SQL or PLSQL code but not Python, Scala or Java. Sometime people start learning a programming language thinking it as the most critical prerequisite for learning Spark/Big Data and they end up spending lots of time and their enthusiasm in strenuous intricacies of coding.  

It’s obvious that more you learn the programming the better developer you will become. However, this article covers how much programming once should learn to get started with Apache Spark. I will mainly cover Python and Scala, and will discuss the bare minimum programming concepts of these languages which you should know to start with Apache Spark.

These are the topics which you should understand first before starting hands on in Spark: -
1.   Variables
2.   Conditional Statements and Loop
3.   Function/Procedure
4.   Exception
5.    Data Structures
6.   Lambda Functions
7.    Creating/Importing modules, jars(in Scala)
8.   Class and Object
9.   And finally some built in methods (like range(), eval(), exec(), len(), rand(), datetime())

Apart from these, a little understanding of data frame would be required which can be covered while working on spark but to begin with Apache Spark you just need to have basic understanding of the above mentioned topics, doesn’t matter which language you prefer. Once you are done then you are good to start.

If you are naïve in programming then I would suggest you to go with Python or Scala, Python will be very easy as it has relatively faster learning curve. I will prepare separate tutorial for Python and Scala to cover all the topics mentioned above.

So Guys, All the Best and get ready to explore the super-fast data processing power of Spark.  

To learn more on Spark click hereLet us know your views or feedback on Facebook or Twitter @BigdataDiscuss.

Deployment modes and job submission in Apache Spark

There are various ways of submitting an application in spark. In Addition to client and cluster modes of execution there is also a local mode of submitting a spark job. We must understand these modes of execution before we start running our job. Before we jump into it we need to recall few important things which learnt in the previous lesson click Introduction to Apache Spark to know more. 

Spark is a Scheduling Monitoring and Distribution engine i.e. Spark is not only a processing engine it can also acts as a resource manager for the job submitted to it. Spark can run by itself(Standalone) using its own cluster manager and can run also on top of other cluster/resource managers. 

How Spark supports different Cluster Managers and Why? 

This is made possible with the help of SparkContext object which is in the main driver program of spark. SparkContext object can connect to several types of cluster managers enabling Spark to run on top of other cluster manager frameworks like Yarn or Mesos. It is this object which coordinates between the independently executing parallel threads of the cluster. 
Spark cluster components, Spark Driver and Workers, Spark Deployment modes, Spark Tutorials
Spark Installation can be launched in three different ways: -
        1.   Local(pseudo-cluster mode)
        2.   Standalone (Cluster with Spark default Cluster manager)
        3.   On top of other Cluster Manager (Cluster with Yarn, Mesos or Kubernetes as Cluster Manager)


Local mode is pseudo-cluster mode generally used for testing and demonstration. In local mode it runs all the execution component in a single node.

Standalone: - 

In Standalone mode the default Cluster manager provided in the official distribution of Apache spark is used for resource and cluster management of Spark Jobs. It has standalone Master for resource Management and Standalone worker for the task.

Please do not get confused here, 
Standalone mode doesn't mean a single node Spark deployment. It is also a cluster deployment of Spark, the only thing to understand here is the cluster will be managed by Spark itself in Standalone mode.

On top of other Cluster Manager: -

Apache Spark can also run on other Cluster managers like Yarn, Mesos or Kubernates. However, the most used cluster manager for Spark in Industry is Yarn because of good compatibility with HDFS and other benefits it brings like data locality.

The command used to submit a spark job in Standalone and other cluster mode is same.
Scala Spark
spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other spark properties options
  <application-jar> \
spark-submit \ \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other Spark properties options
  --py-files <python-modules-jars>
Table 1: Spark-submit command in Scala and Python

For Python applications, in place of a JAR we need to simply pass our .py file as <application-jar>, and add Python dependencies like modules, .zip, .egg or .py files in --py-files.

How to submit a Spark job on Standalone Cluster vs Cluster managed by other cluster managers? 

Answer to the above question is very simple. You need to use the "--master" option show in the above spark submit command and pass the master url of the cluster e.g.
Value of “--master”
For Standalone deployment mode
--master spark://HOST:PORT
For Mesos
--master mesos://HOST:PORT
For Yarn
--master yarn
--master local[*] :: * = number of threads
Table 2: Spark-submit "--master" for different Spark deployment modes

When you submit a job in spark the application jars (the code which you have written for the job) is distributed to all worker nodes along with the jar files(if mentioned)

We talked enough about the Cluster deployment mode, now we need to understand the application "--deploy-mode" . The above deployment modes which we discussed so far is Cluster Deployment mode and is different from the "--deploy-mode" mentioned in spark-submit command(table 1) . --deploy-mode is the application(or driver) deploy mode which tells how to run the job in cluster(as already mentioned cluster can be a standalone, a yarn or Mesos). For an application(spark job) to run on cluster there are two --deploy-modes, one is client and other is cluster mode.

Spark Application deploy modes: -

Cluster: - When the driver runs inside the cluster then it is cluster deploy mode. In this case Resource Manager or Master decides which node the driver will run.

Client: - In Client mode the driver runs in the machine where the job is submitted.

Now the question arises -

"How to submit a job in Cluster or Client  mode and which is better?"

How to submit:-

In the above spark submit command just pass  "--deploy-mode client" for client mode and "--deploy-mode cluster" for cluster mode.


Which one is better, Client or Cluster mode:

Unlike Cluster mode in client mode if the client machine is disconnected then the job will fail. Client mode is good if you want to work on spark interactively, also if you don’t want to eat up any resource from your cluster for the driver daemon then you should go for client mode. When dealing with huge data set and calling action on RDDs or Dfs you need to make sure you have sufficient resources available on Client. So it’s not like the cluster or client mode is better than the other. You can choose any deploy mode for your application, it depends on what suits your requirement.

Driver runs in the machine where the job is submitted.
Driver runs inside the cluster. Resource Manager or Master decides which node the driver will run
Job fails if the driver is disconnected
After submitting the job client can disconnect.
Can be used to work with spark in an interactive manner. Performing action on RDD or DataFrame(like count) and capturing them in logs becomes easy.
Cannot be used to work with spark in an interactive manner.
Jars can be accessed from Client  machine.
Since the driver runs on a different machine than the client, so the jars present in local machine won’t work. Those jars should be made available to all nodes either by placing them on each node or mention them in --jars or as –py-files during spark-submit.
Spark driver does not run on the YARN cluster only executor runs inside YARN cluster.

Spark driver and executor both runs on the YARN cluster.
The local dir used by driver is spark.local.dir and for executor it is YARN config yarn.nodemanager.local-dirs.
The local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs)
Table 3: Spark Client Vs Cluster Mode

Here are some examples on submitting a spark job in different modes: -
./bin/spark-submit \
  --class main_class \
  --master local[8] \
./bin/spark-submit \
  --master local[8] \
Spark Standalone: -

./bin/spark-submit \
  --class main_class \
 --master spark://<ip-address>:7077 \
 --deploy-mode cluster \
 --supervise \
 --executor-memory 10G \
 --total-executor-cores 100 \
./bin/spark-submit \
--master spark://<ip-add>:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 10G \
  --total-executor-cores 100 \
Yarn Cluster mode
./bin/spark-submit \
  --class main_class \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 10G \
  --num-executors 50 \
./bin/spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 10G \
  --num-executors 50 \
Table 4: Spark submit examples for different mode

To learn more on Spark click here. Let us know your views or feedback on Facebook or Twitter @BigDataDiscuss.



View more

Apache Spark

View more


View more


View more


View more


View more