Big Data – Various Learning Resources – How to Start with Big Data? – Day 20 of 21

In yesterday’s blog post we learned how to become a Data Scientist for Big Data. In this article we will go over various learning resources related to Big Data.

In this series we have covered many of the most essential details about Big Data. At the beginning of this series, I have encouraged readers to send me questions. One of the most popular questions is -

“I want to learn more about Big Data. Where can I learn it?”

This is indeed a great question as there are plenty of resources out to learn about Big Data and it is indeed difficult to select on one resource to learn Big Data. Hence I decided to write here a few of the very important resources which are related to Big Data.

Learn from Pluralsight

Pluralsight is a global leader in high-quality online training for hardcore developers.  It has fantastic Big Data Courses and I started to learn about Big Data with the help of Pluralsight. Here are few of the courses which are directly related to Big Data.

I encourage all of you start with this video course as they are fantastic fundamentals to learn Big Data.

Learn from Apache

Resources at Apache are single point the most authentic learning resources. If you want to learn fundamentals and go deep about every aspect of the Big Data, I believe you must understand various concepts in Apache’s library. I am pretty impressed with the documentation and I am personally referencing it every single day when I work with Big Data. I strongly encourage all of you to bookmark following all the links for authentic big data learning.

  • Haddop – The Apache Hadoop® project develops open-source software for reliable, scalable, distributed computing.
  • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which include support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heat maps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.
  • Avro: A data serialization system.
  • Cassandra: A scalable multi-master database with no single points of failure.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase: A scalable, distributed database that supports structured data storage for large tables.
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout: A Scalable machine learning and data mining library.
  • Pig: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper: A high-performance coordination service for distributed applications.

Learn from Vendors

One of the biggest issues with about learning Big Data is setting up the environment. Every Big Data vendor has different environment request and there are lots of things require to set up Big Data framework. Many of the users do not start with Big Data as they are afraid about the resources required to set up framework as well as a time commitment. Here Hortonworks have created fantastic learning environment. They have created Sandbox with everything one person needs to learn Big Data and also have provided excellent tutoring along with it. Sandbox comes with a dozen hands-on tutorial that will guide you through the basics of Hadoop as well it contains the Hortonworks Data Platform.

I think Hortonworks did a fantastic job building this Sandbox and Tutorial. Though there are plenty of different Big Data Vendors I have decided to list only Hortonworks due to their unique setup. Please leave a comment if there are any other such platform to learn Big Data. I will include them over here as well.

Learn from Books

There are indeed few good books out there which one can refer to learn Big Data. Here are few good books which I have read. I will update the list as I will learn more.

If you search on Amazon there are millions of the books but I think above three books are a great set of books and it will give you great ideas about Big Data. Once you go through above books, you will have a clear idea about what is the next step you should follow in this series. You will be capable enough to make the right decision for yourself.


In tomorrow’s blog post we will wrap up this series of Big Data.

Reference: Pinal Dave (

Big Data – How to become a Data Scientist and Learn Data Science? – Day 19 of 21

In yesterday’s blog post we learned the importance of the analytics in Big Data Story. In this article we will understand how to become a Data Scientist for Big Data Story.

Data Scientist is a new buzz word, everyone seems to be wanting to become Data Scientist. Let us go over a few key topics related to Data Scientist in this blog post. First of all we will understand what is a Data Scientist.

In the new world of Big Data, I see pretty much everyone wants to become Data Scientist and there are lots of people I have already met who claims that they are Data Scientist. When I ask what is their role, I have got a wide variety of answers.

What is Data Scientist?

Data scientists are the experts who understand various aspects of the business and know how to strategies data to achieve the business goals. They should have a solid foundation of various data algorithms, modeling and statistics methodology.

What do Data Scientists do?

Data scientists understand the data very well. They just go beyond the regular data algorithms and builds interesting trends from available data. They innovate and resurrect the entire new meaning from the existing data. They are artists in disguise of computer analyst. They look at the data traditionally as well as explore various new ways to look at the data.

Data Scientists do not wait to build their solutions from existing data. They think creatively, they think before the data has entered into the system. Data Scientists are visionary experts who understands the business needs and plan ahead of the time, this tremendously help to build solutions at rapid speed.

Besides being data expert, the major quality of Data Scientists is “curiosity”. They always wonder about what more they can get from their existing data and how to get maximum out of future incoming data.

Data Scientists do wonders with the data, which goes beyond the job descriptions of Data Analysist or Business Analysist.

Skills Required for Data Scientists

Here are few of the skills a Data Scientist must have.

  • Expert level skills with statistical tools like SAS, Excel, R etc.
  • Understanding Mathematical Models
  • Hands-on with Visualization Tools like Tableau, PowerPivots, D3. j’s etc.
  • Analytical skills to understand business needs
  • Communication skills

On the technology front any Data Scientists should know underlying technologies like (Hadoop, Cloudera) as well as their entire ecosystem (programming language, analysis and visualization tools etc.) .

Remember that for becoming a successful Data Scientist one require have par excellent skills, just having a degree in a relevant education field will not suffice.

Final Note

Data Scientists is indeed very exciting job profile. As per research there are not enough Data Scientists in the world to handle the current data explosion. In near future Data is going to expand exponentially, and the need of the Data Scientists will increase along with it. It is indeed the job one should focus if you like data and science of statistics.

Courtesy: emc


In tomorrow’s blog post we will discuss about various Big Data Learning resources.

Reference: Pinal Dave (

Big Data – Basics of Big Data Analytics – Day 18 of 21

In yesterday’s blog post we learned the importance of the various components in Big Data Story. In this article we will understand what are the various analytics tasks we try to achieve with the Big Data and the list of the important tools in Big Data Story.

When you have plenty of the data around you what is the first thing which comes to your mind?

“What do all these data means?”

Exactly – the same thought comes to my mind as well. I always wanted to know what all the data means and what meaningful information I can receive out of it. Most of the Big Data projects are built to retrieve various intelligence all this data contains within it. Let us take example of Facebook. When I look at my friends list of Facebook, I always want to ask many questions such as -

  • On which date my maximum friends have a birthday?
  • What is the most favorite film of my most of the friends so I can talk about it and engage them?
  • What is the most liked placed to travel my friends?
  • Which is the most disliked cousin for my friends in India and USA so when they travel, I do not take them there.

There are many more questions I can think of. This illustrates that how important it is to have analysis of Big Data.

Here are few of the kind of analysis listed which you can use with Big Data.

Slicing and Dicing: This means breaking down your data into smaller set and understanding them one set at a time. This also helps to present various information in a variety of different user digestible ways. For example if you have data related to movies, you can use different slide and dice data in various formats like actors, movie length etc.

Real Time Monitoring: This is very crucial in social media when there are any events happening and you wanted to measure the impact at the time when the event is happening. For example, if you are using twitter when there is a football match, you can watch what fans are talking about football match on twitter when the event is happening.

Anomaly Predication and Modeling: If the business is running normal it is alright but if there are signs of trouble, everyone wants to know them early on the hand. Big Data analysis of various patterns can be very much helpful to predict future. Though it may not be always accurate but certain hints and signals can be very helpful. For example, lots of data can help conclude that if there is lots of rain it can increase the sell of umbrella.

Text and Unstructured Data Analysis: unstructured data are now getting norm in the new world and they are a big part of the Big Data revolution. It is very important that we Extract, Transform and Load the unstructured data and make meaningful data out of it. For example, analysis of lots of images, one can predict that people like to use certain colors in certain months in their cloths.

Big Data Analytics Solutions

There are many different Big Data Analystics Solutions out in the market. It is impossible to list all of them so I will list a few of them over here.

  • Tableau – This has to be one of the most popular visualization tools out in the big data market.
  • SAS – A high performance analytics and infrastructure company
  • IBM and Oracle – They have a range of tools for Big Data Analysis


In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Data Scientist.

Reference: Pinal Dave (

Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21

In yesterday’s blog post we learned the importance of the Pig and Pig Latin in Big Data Story. In this article we will understand what is Sqoop and Zookeeper in Big Data Story.

There are two most important components one should learn when learning about interacting with Hadoop – Sqoop and Zookper.

What is Sqoop?

Most of the business stores their data in RDBMS as well as other data warehouse solutions. They need a way to move data to the Hadoop system to do various processing and return it back to RDBMS from Hadoop system. The data movement can happen in real time or at various intervals in bulk. We need a tool which can help us move this data from SQL to Hadoop and from Hadoop to SQL. Sqoop (SQL to Hadoop) is such a tool which extract data from non-Hadoop data sources and transform them into the format which Hadoop can use it and later it loads them into HDFS. Essentially it is ETL tool where it Extracts, Transform and Load from SQL to Hadoop. The best part is that it also does extract data from Hadoop and loads them to Non-SQL (or RDBMS) data stores. Essentially, Sqoop is a command line tool which does SQL to Hadoop and Hadoop to SQL. It is a command line interpreter. It creates MapReduce job behinds the scene to import data from an external database to HDFS. It is very effective and easy to learn tool for nonprogrammers.

What is Zookeeper?

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In other words Zookeeper is a replicated synchronization service with eventual consistency. In simpler words – in Hadoop cluster there are many different nodes and one node is master. Let us assume that master node fails due to any reason. In this case, the role of the master node has to be transferred to a different node. The main role of the master node is managing the writers as that task requires persistence in order of writing. In this kind of scenario Zookeeper will assign new master node and make sure that Hadoop cluster performs without any glitch. Zookeeper is the Hadoop’s method of coordinating all the elements of these distributed systems. Here are few of the tasks which Zookeepr is responsible for.

  • Zookeeper manages the entire workflow of starting and stopping various nodes in the Hadoop’s cluster.
  • In Hadoop cluster when any processes need certain configuration to complete the task. Zookeeper makes sure that certain node gets necessary configuration consistently.
  • In case of the master node fails, Zookeepr can assign new master node and make sure cluster works as expected.

There many other tasks Zookeeper performance when it is about Hadoop cluster and communication. Basically without the help of Zookeeper it is not possible to design any new fault tolerant distributed application.


In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Big Data Analytics.

Reference: Pinal Dave (

Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21

In yesterday’s blog post we learned the importance of the HIVE in Big Data Story. In this article we will understand what is PIG and PIG Latin in Big Data Story.

Yahoo started working on Pig for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data.

What is Pig and What is Pig Latin?

Pig is a high level platform for creating MapReduce programs used with Hadoop and the language we use for this platform is called PIG Latin. The pig was designed to make Hadoop more user-friendly and approachable by power-users and nondevelopers. PIG is an interactive execution environment supporting Pig Latin language. The language Pig Latin has supported loading and processing of input data with series of transforming to produce desired results. PIG has two different execution environments 1) Local Mode – In this case all the scripts run on a single machine. 2) Hadoop – In this case all the scripts run on Hadoop Cluster.

Pig Latin vs SQL

Pig essentially creates set of map and reduce jobs under the hoods. Due to same users does not have to now write, compile and build solution for Big Data. The pig is very similar to SQL in many ways. The Ping Latin language provide an abstraction layer over the data. It focuses on the data and not the structure under the hood. Pig Latin is a very powerful language and it can do various operations like loading and storing data, streaming data, filtering data as well various data operations related to strings. The major difference between SQL and Pig Latin is that PIG is procedural and SQL is declarative. In simpler words, Pig Latin is very similar to SQ Lexecution plan and that makes it much easier for programmers to build various processes. Whereas SQL handles trees naturally, Pig Latin follows directed acyclic graph (DAG). DAGs is used to model several different kinds of structures in mathematics and computer science.



In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Zookeeper.

Reference: Pinal Dave (

Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21

In yesterday’s blog post we learned the importance of the operational database in Big Data Story. In this article we will understand what is Hive and HQL in Big Data Story.

Yahoo started working on PIG (we will understand that in the next blog post) for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. Similarly Facebook started deploying their warehouse solutions on Hadoop which has resulted in HIVE. The reason for going with HIVE is because the traditional warehousing solutions are getting very expensive.

What is HIVE?

Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there.

HIVE Organization

The data are organized in three different formats in HIVE.

Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.

Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.

Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.

Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here.

What is HiveSQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily.

  • Create and manage tables and partitions
  • Support various Relational, Arithmetic and Logical Operators
  • Evaluate functions
  • Download the contents of a table to a local directory or result of queries to HDFS directory

Here is the example of the HQL Query:

SELECT upper(name), salesprice
FROM sales;
SELECT category, count(1) 
FROM products 
GROUP BY category;

When you look at the above query, you can see they are very similar to SQL like queries.


In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Pig.

Reference: Pinal Dave (

Big Data – Operational Databases Supporting Big Data – Columnar, Graph and Spatial Database – Day 14 of 21

In yesterday’s blog post we learned the importance of the Key-Value Pair Databases and Document Databases in the Big Data Story. In this article we will understand the role of Columnar, Graph and Spatial Database supporting Big Data Story.

Now we will see a few of the examples of the operational databases.

  • Relational Databases (The day before yesterday’s post)
  • NoSQL Databases (The day before yesterday’s post)
  • Key-Value Pair Databases (Yesterday’s post)
  • Document Databases (Yesterday’s post)
  • Columnar Databases (Tomorrow’s post)
  • Graph Databases (Today’s post)
  • Spatial Databases (Today’s post)

Columnar Databases 

Relational Database is a row store database or a row oriented database. Columnar databases are column oriented or column store databases. As we discussed earlier in Big Data we have different kinds of data and we need to store different kinds of data in the database. When we have columnar database it is very easy to do so as we can just add a new column to the columnar database. HBase is one of the most popular columnar databases. It uses Hadoop file system and MapReduce for its core data storage. However, remember this is not a good solution for every application. This is particularly good for the database where there is high volume incremental data is gathered and processed.

Graph Databases

For a highly interconnected data it is suitable to use Graph Database. This database has node relationship structure. Nodes and relationships contain a Key Value Pair where data is stored. The major advantage of this database is that it supports faster navigation among various relationships. For example, Facebook uses a graph database to list and demonstrate various relationships between users. Neo4J is one of the most popular open source graph database. One of the major dis-advantage of the Graph Database is that it is not possible to self-reference (self joins in the RDBMS terms) and there might be real world scenarios where this might be required and graph database does not support it.

Spatial Databases 

We all use Foursquare, Google+ as well Facebook Check-ins for location aware check-ins. All the location aware applications figure out the position of the phone with the help of Global Positioning System (GPS). Think about it, so many different users at different location in the world and checking-in all together. Additionally, the applications now feature reach and users are demanding more and more information from them, for example like movies, coffee shop or places see. They are all running with the help of Spatial Databases. Spatial data are standardize by the Open Geospatial Consortium known as OGC. Spatial data helps answering many interesting questions like “Distance between two locations, area of interesting places etc.” When we think of it, it is very clear that handing spatial data and returning meaningful result is one big task when there are millions of users moving dynamically from one place to another place & requesting various spatial information. PostGIS/OpenGIS suite is very popular spatial database. It runs as a layer implementation on the RDBMS PostgreSQL. This makes it totally unique as it offers best from both the worlds.

Courtesy: mushroom network


In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Hive.

Reference: Pinal Dave (