Big Data – Buzz Words: What is HDFS – Day 8 of 21

In yesterday’s blog post we learned what is MapReduce. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – HDFS.

What is HDFS ?

HDFS stands for Hadoop Distributed File System and it is a primary storage system used by Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed on low-cost commodity hardware. In commodity hardware deployment server failures are very common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of failure.

HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each smaller piece to multiple times on different nodes. Hence when any node with the data crashes the system is automatically able to use the data from a different node and continue the process. This is the key feature of the HDFS system.

Architecture of HDFS

The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of single NameNode. This single NameNode is a master server and it manages the file system as well regulates access to various files. In additional to NameNode there are multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file is split into one or more blocks and those blocks are stored in a set of DataNodes.

The primary task of the NameNode is to open, close or rename files and directory and regulate access to the file system, whereas the primary task of the DataNode is read and write to the file systems. DataNode is also responsible for the creation, deletion or replication of the data based on the instruction from NameNode.

In reality, NameNode and DataNode are software designed to run on commodity machine build in Java language.

Visual Representation of HDFS Architecture

hdfs 1 Big Data   Buzz Words: What is HDFS   Day 8 of 21

Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks directly to the DataNode. Client App does not have to directly write to all the node. It just has to write to any one of the node and NameNode will decide on which other DataNode it will have to replicate the data. In our example Client App directly writes to DataNode 1 and detained 3. However, data chunks are automatically replicated to other nodes. All the information like in which DataNode which data block is placed is written back to NameNode.

High Availability During Disaster

Now as multiple DataNode have same data blocks in the case of any DataNode which faces the disaster, the entire process will continue as other DataNode will assume the role to serve the specific data block which was on the failed node. This system provides very high tolerance to disaster and provides high availability.

If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop Application will stop performing as it is a single node where we store all the metadata. As this node is very critical, it is usually replicated on another clustered as well as on another data rack. Though, that replicated node is not operational in architecture, it has all the necessary data to perform the task of the NameNode in the case of the NameNode fails.

The entire Hadoop architecture is built to function smoothly even there are node failures or hardware malfunction. It is built on the simple concept that data is so big it is impossible to have come up with a single piece of the hardware which can manage it properly. We need lots of commodity (cheap) hardware to manage our big data and hardware failure is part of the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to overcome the limitation of the non-functioning hardware.


In tomorrow’s blog post we will discuss the importance of the relational database in Big Data.

Reference: Pinal Dave (

Big Data – Buzz Words: What is MapReduce – Day 7 of 21

In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce.

What is MapReduce?

MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time.

MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programing. The library where procedure Map() and Reduce() belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop which we will explore tomorrow.

mapreduce Big Data   Buzz Words: What is MapReduce   Day 7 of 21

Advantages of MapReduce Procedures

The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance. Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node.

As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines.

How do MapReduce Framework Works?

A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers.

Map() Procedure

There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node.

Reduce() Procedure

All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node.

The MapReduce Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data).

The MapReduce Framework has five different steps:

  • Preparing Map() Input
  • Executing User Provided Map() Code
  • Shuffle Map Output to Reduce Processor
  • Executing User Provided Reduce Code
  • Producing the Final Output

Here is the Dataflow of MapReduce Framework:

  • Input Reader
  • Map Function
  • Partition Function
  • Compare Function
  • Reduce Function
  • Output Writer

In a future blog post of this 31 day series we will explore various components of MapReduce in Detail.

MapReduce in a Single Statement

MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.


In tomorrow’s blog post we will discuss Buzz Word – HDFS.

Reference: Pinal Dave (

Big Data – Buzz Words: What is Hadoop – Day 6 of 21

In yesterday’s blog post we learned what is NoSQL. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – Hadoop.

What is Hadoop?

Apache Hadoop is an open-source, free and Java based software framework offers a powerful distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs applications on large clusters of commodity hardware and it processes thousands of terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and high availability.

hadoopbanner Big Data   Buzz Words: What is Hadoop   Day 6 of 21

What are the core components of Hadoop?

There are two major components of the Hadoop framework and both fo them does two of the important task for it.

  • Hadoop MapReduce is the method to split a larger data problem into smaller chunk and distribute it to many different commodity servers. Each server have their own set of resources and they have processed them locally. Once the commodity server has processed the data they send it back collectively to main server. This is effectively a process where we process large data effectively and efficiently. (We will understand this in tomorrow’s blog post).
  • Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference between any other file system and Hadoop. When we move a file on HDFS, it is automatically split into many small pieces. These small chunks of the file are replicated and stored on other servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day after tomorrow’s blog post).

Besides above two core components Hadoop project also contains following modules as well.

  • Hadoop Common: Common utilities for the other Hadoop modules
  • Hadoop Yarn: A framework for job scheduling and cluster resource management

There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will gradually explore in later blog posts.

A Multi-node Hadoop Cluster Architecture

Now let us quickly see the architecture of the a multi-node Hadoop cluster.

hadooparchitecture Big Data   Buzz Words: What is Hadoop   Day 6 of 21

A small Hadoop cluster includes a single master node and multiple worker or slave node. As discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and another is of HDFS Layer. Each of these layer have its own relevant component. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or compute node. The matter of the fact that is the key feature of the Hadoop.

In this introductory blog post we will stop here while describing the architecture of Hadoop. In a future blog post of this 31 day series we will explore various components of Hadoop Architecture in Detail.

Why Use Hadoop?

There are many advantages of using Hadoop. Let me quickly list them over here:

  • Robust and Scalable – We can add new nodes as needed as well modify them.
  • Affordable and Cost Effective – We do not need any special hardware for running Hadoop. We can just use commodity server.
  • Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured and unstructured data.
  • Highly Available and Fault Tolerant – When a node fails, the Hadoop framework automatically fails over to another node.

hadoop elephant Big Data   Buzz Words: What is Hadoop   Day 6 of 21Why Hadoop is named as Hadoop?

In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at Yahoo. Doug Cutting named Hadoop after his son’s toy elephant.


In tomorrow’s blog post we will discuss Buzz Word – MapReduce.

Reference: Pinal Dave (

Big Data – Buzz Words: What is NoSQL – Day 5 of 21

In yesterday’s blog post we explored the basic architecture of Big Data . In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – NoSQL.

What is NoSQL?

NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL means there is No SQL, which is not true – they both sound same but the meaning is totally different. NoSQL does use SQL but it uses more than SQL to achieve its goal. As per Wikipedia’s NoSQL Database Definition – “A NoSQL database provides a mechanism for storage and retrieval of data that uses looser consistency models than traditional relational databases.

nosql Big Data   Buzz Words: What is NoSQL   Day 5 of 21

Why use NoSQL?

A traditional relation database usually deals with predictable structured data. Whereas as the world has moved forward with unstructured data we often see the limitations of the traditional relational database in dealing with them. For example, nowadays we have data in format of SMS, wave files, photos and video format. It is a bit difficult to manage them by using a traditional relational database. I often see people using BLOB filed to store such a data. BLOB can store the data but when we have to retrieve them or even process them the same BLOB is extremely slow in processing the unstructured data. A NoSQL database is the type of database that can handle unstructured, unorganized and unpredictable data that our business needs it.

Along with the support to unstructured data, the other advantage of NoSQL Database is high performance and high availability.

Eventual Consistency

Additionally to note that NoSQL Database may not provided 100% ACID (Atomicity, Consistency, Isolation, Durability) compliance.  Though, NoSQL Database does not support ACID they provide eventual consistency. That means over the long period of time all updates can be expected to propagate eventually through the system and data will be consistent.


Taxonomy is the practice of classification of things or concepts and the principles. The NoSQL taxonomy supports column store, document store, key-value stores, and graph databases. We will discuss the taxonomy in detail in later blog posts. Here are few of the examples of the each of the No SQL Category.

  • Column: Hbase, Cassandra, Accumulo
  • Document: MongoDB, Couchbase, Raven
  • Key-value : Dynamo, Riak, Azure, Redis, Cache, GT.m
  • Graph: Neo4J, Allegro, Virtuoso, Bigdata

As of now there are over 150 NoSQL Database and you can read everything about them in this single link.


In tomorrow’s blog post we will discuss Buzz Word – Hadoop.

Reference: Pinal Dave (

Big Data – Basics of Big Data Architecture – Day 4 of 21

In yesterday’s blog post we understood how Big Data evolution happened. Today we will understand basics of the Big Data Architecture.

Big Data Cycle

Just like every other database related applications, bit data project have its development cycle. Though three Vs (link) for sure plays an important role in deciding the architecture of the Big Data projects. Just like every other project Big Data project also goes to similar phases of the data capturing, transforming, integrating, analyzing and building actionable reporting on the top of  the data.

While the process looks almost same but due to the nature of the data the architecture is often totally different. Here are few of the question which everyone should ask before going ahead with Big Data architecture.

Questions to Ask

How big is your total database?

What is your requirement of the reporting in terms of time – real time, semi real time or at frequent interval?

How important is the data availability and what is the plan for disaster recovery?

What are the plans for network and physical security of the data?

What platform will be the driving force behind data and what are different service level agreements for the infrastructure?

This are just basic questions but based on your application and business need you should come up with the custom list of the question to ask. As I mentioned earlier this question may look quite simple but the answer will not be simple. When we are talking about Big Data implementation there are many other important aspects which we have to consider when we decide to go for the architecture.

Building Blocks of Big Data Architecture

It is absolutely impossible to discuss and nail down the most optimal architecture for any Big Data Solution in a single blog post, however, we can discuss the basic building blocks of big data architecture. Here is the image which I have built to explain how the building blocks of the Big Data architecture works.

bigdataarchitecture Big Data   Basics of Big Data Architecture   Day 4 of 21

Above image gives good overview of how in Big Data Architecture various components are associated with each other. In Big Data various different data sources are part of the architecture hence extract, transform and integration are one of the most essential layers of the architecture. Most of the data is stored in relational as well as non relational data marts and data warehousing solutions. As per the business need various data are processed as well converted to proper reports and visualizations for end users. Just like software the hardware is almost the most important part of the Big Data Architecture. In the big data architecture hardware infrastructure is extremely important and failure over instances as well as redundant physical infrastructure is usually implemented.

NoSQL in Data Management

NoSQL is a very famous buzz word and it really means Not Relational SQL or Not Only SQL. This is because in Big Data Architecture the data is in any format. It can be unstructured, relational or in any other format or from any other data source. To bring all the data together relational technology is not enough, hence new tools, architecture and other algorithms are invented which takes care of all the kind of data. This is collectively called NoSQL.


Next four days we will answer the Buzz Words – Hadoop.

Reference: Pinal Dave (

Big Data – Evolution of Big Data – Day 3 of 21

In yesterday’s blog post we answered what is the Big Data. Today we will understand why and how the evolution of Big Data has happened. Though the answer is very simple, I would like to tell it in the form of a history lesson.

Data in Flat File

textfile Big Data   Evolution of Big Data   Day 3 of 21In earlier days data was stored in the flat file and there was no structure in the flat file.  If any data has to be retrieved from the flat file it was a project by itself. There was no possibility of retrieving the data efficiently and data integrity has been just a term discussed without any modeling or structure around. Database residing in the flat file had more issues than we would like to discuss in today’s world. It was more like a nightmare when there was any data processing involved in the application. Though, applications developed at that time were also not that advanced the need of the data was always there and there was always need of proper data management.

Edgar F Codd and 12 Rules

efcodd Big Data   Evolution of Big Data   Day 3 of 21Edgar Frank Codd was a British computer scientist who, while working for IBM, invented the relational model for database management, the theoretical basis for relational databases. He presented 12 rules for the Relational Database and suddenly the chaotic world of the database seems to see discipline in the rules. Relational Database was a promising land for all the unstructured database users. Relational Database brought into the relationship between data as well improved the performance of the data retrieval. Database world had immediately seen a major transformation and every single vendors and database users suddenly started to adopt the relational database models.

Relational Database Management Systems

Since Edgar F Codd proposed 12 rules for the RBDMS there were many different vendors who started them to build applications and tools to support the relationship between database. This was indeed a learning curve for many of the developer who had never worked before with the modeling of the database. However, as time passed by pretty much everybody accepted the relationship of the database and started to evolve product which performs its best with the boundaries of the RDBMS concepts. This was the best era for the databases and it gave the world extreme experts as well as some of the best products. The Entity Relationship model was also evolved at the same time. In software engineering, an Entity–relationship model (ER model) is a data model for describing a database in an abstract way.

Enormous Data Growth

rdbms Big Data   Evolution of Big Data   Day 3 of 21Well, everything was going fine with the RDBMS in the database world. As there were no major challenges the adoption of the RDBMS applications and tools was pretty much universal. There was a race at times to make the developer’s life much easier with the RDBMS management tools. Due to the extreme popularity and easy to use system pretty much every data was stored in the RDBMS system. New age applications were built and social media took the world by the storm. Every organizations was feeling pressure to provide the best experience for their users based the data they had with them. While this was all going on at the same time data was growing pretty much every organization and application.

Data Warehousing

The enormous data growth now presented a big challenge for the organizations who wanted to build intelligent systems based on the data and provide near real time superior user experience to their customers. Various organizations immediately start building data warehousing solutions where the data was stored and processed. The trend of the business intelligence becomes the need of everyday. Data was received from the transaction system and overnight was processed to build intelligent reports from it. Though this is a great solution it has its own set of challenges. The relational database model and data warehousing concepts are all built with keeping traditional relational database modeling in the mind and it still has many challenges when unstructured data was present.

Interesting Challenge

challenges Big Data   Evolution of Big Data   Day 3 of 21Every organization had expertise to manage structured data but the world had already changed to unstructured data. There was intelligence in the videos, photos, SMS, text, social media messages and various other data sources. All of these needed to now bring to a single platform and build a uniform system which does what businesses need. The way we do business has also been changed. There was a time when user only got the features what technology supported, however, now users ask for the feature and technology is built to support the same. The need of the real time intelligence from the fast paced data flow is now becoming a necessity.

Large amount (Volume) of difference (Variety) of high speed data (Velocity) is the properties of the data. The traditional database system has limits to resolve the challenges this new kind of the data presents. Hence the need of the Big Data Science. We need innovation in how we handle and manage data. We need creative ways to capture data and present to users.

Big Data is Reality!


In tomorrow’s blog post we will try to answer discuss Basics of Big Data Architecture.

Reference: Pinal Dave (

Big Data – What is Big Data – 3 Vs of Big Data – Volume, Velocity and Variety – Day 2 of 21

3vs Big Data   What is Big Data   3 Vs of Big Data   Volume, Velocity and Variety   Day 2 of 21Data is forever. Think about it – it is indeed true. Are you using any application as it is which was built 10 years ago? Are you using any piece of hardware which was built 10 years ago? The answer is most certainly No. However, if I ask you – are you using any data which were captured 50 years ago, the answer is most certainly Yes. For example, look at the history of our nation. I am from India and we have documented history which goes back as over 1000s of year. Well, just look at our birthday data – atleast we are using it till today. Data never gets old and it is going to stay there forever.  Application which interprets and analysis data got changed but the data remained in its purest format in most cases.

As organizations have grown the data associated with them also grew exponentially and today there are lots of complexity to their data. Most of the big organizations have data in multiple applications and in different formats. The data is also spread out so much that it is hard to categorize with a single algorithm or logic. The mobile revolution which we are experimenting right now has completely changed how we capture the data and build intelligent systems.  Big organizations are indeed facing challenges to keep all the data on a platform which give them a  single consistent view of their data. This unique challenge to make sense of all the data coming in from different sources and deriving the useful actionable information out of is the revolution Big Data world is facing.

Defining Big Data

The 3Vs that define Big Data are Variety, Velocity and Volume.

3vs Big Data   What is Big Data   3 Vs of Big Data   Volume, Velocity and Variety   Day 2 of 21


We currently see the exponential growth in the data storage as the data is now more than text data. We can find data in the format of videos, musics and large images on our social media channels. It is very common to have Terabytes and Petabytes of the storage system for enterprises. As the database grows the applications and architecture built to support the data needs to be reevaluated quite often. Sometimes the same data is re-evaluated with multiple angles and even though the original data is the same the new found intelligence creates explosion of the data. The big volume indeed represents Big Data.


The data growth and social media explosion have changed how we look at the data. There was a time when we used to believe that data of yesterday is recent. The matter of the fact newspapers is still following that logic. However, news channels and radios have changed how fast we receive the news. Today, people reply on social media to update them with the latest happening. On social media sometimes a few seconds old messages (a tweet, status updates etc.) is not something interests users. They often discard old messages and pay attention to recent updates. The data movement is now almost real time and the update window has reduced to fractions of the seconds. This high velocity data represent Big Data.


Data can be stored in multiple format. For example database, excel, csv, access or for the matter of the fact, it can be stored in a simple text file. Sometimes the data is not even in the traditional format as we assume, it may be in the form of video, SMS, pdf or something we might have not thought about it. It is the need of the organization to arrange it and make it meaningful. It will be easy to do so if we have data in the same format, however it is not the case most of the time. The real world have data in many different formats and that is the challenge we need to overcome with the Big Data. This variety of the data represent  represent Big Data.

Big Data in Simple Words

Big Data is not just about lots of data, it is actually a concept providing an opportunity to find new insight into your existing data as well guidelines to capture and analysis your future data. It makes any business more agile and robust so it can adapt and overcome business challenges.


In tomorrow’s blog post we will try to answer discuss Evolution of Big Data.

Reference: Pinal Dave (

Big Data – Beginning Big Data – Day 1 of 21

What is Big Data?

I want to learn Big Data. I have no clue where and how to start learning about it.

Does Big Data really means data is big?

What are the tools and software I need to know to learn Big Data?

I often receive questions which I mentioned above. They are good questions and honestly when we search online, it is hard to find authoritative and authentic answers. I have been working with Big Data and NoSQL for a while and I have decided that I will attempt to discuss this subject over here in the blog.

In the next 21 days we will understand what is so big about Big Data.

Big Data – Big Thing!

Big Data is becoming one of the most talked about technology trends nowadays. The real challenge with the big organization is to get maximum out of the data already available and predict what kind of data to collect in the future. How to take the existing data and make it meaningful that it provides us accurate insight in the past data is one of the key discussion points in many of the executive meetings in organizations. With the explosion of the data the challenge has gone to the next level and now a Big Data is becoming the reality in many organizations.

Big Data – A Rubik’s Cube

rubiks cube Big Data   Beginning Big Data   Day 1 of 21I like to compare big data with the Rubik’s cube. I believe they have many similarities. Just like a Rubik’s cube it has many different solutions. Let us visualize a Rubik’s cube solving challenge where there are many experts participating. If you take five Rubik’s cube and mix up the same way and give it to five different expert to solve it. It is quite possible that all the five people will solve the Rubik’s cube in fractions of the seconds but if you pay attention to the same closely, you will notice that even though the final outcome is the same, the route taken to solve the Rubik’s cube is not the same. Every expert will start at a different place and will try to resolve it with different methods. Some will solve one color first and others will solve another color first. Even though they follow the same kind of algorithm to solve the puzzle they will start and end at a different place and their moves will be different at many occasions. It is  nearly impossible to have a exact same route taken by two experts.

Big Market and Multiple Solutions

Big Data is exactly like a Rubik’s cube – even though the goal of every organization and expert is same to get maximum out of the data, the route and the starting point are different for each organization and expert. As organizations are evaluating and architecting big data solutions they are also learning the ways and opportunities which are related to Big Data. There is not a single solution to big data as well there is not a single vendor which can claim to know all about Big Data. Honestly, Big Data is too big a concept and there are many players – different architectures, different vendors and different technology.

What is Next?

In this 31 days series we will be exploring many essential topics related to big data. I do not claim that you will be master of the subject after 31 days but I claim that I will be covering following topics in easy to understand language.

  • Architecture of Big Data
  • Big Data a Management and Implementation
  • Different Technologies – Hadoop, Mapreduce
  • Real World Conversations
  • Best Practices


In tomorrow’s blog post we will try to answer one of the very essential questions – What is Big Data?

Reference: Pinal Dave (

Big Data – Beginning Big Data Series Next Month in 21 Parts

Big Data is the next big thing. There was a time when we used to talk in terms of MB and GB of the data. However, the industry is changing and we are now moving to a conversation where we discuss about data in Petabyte, Exabyte and Zettabyte. It seems that the world is now talking about increased Volume of the data. In simple world we all think that Big Data is nothing but plenty of volume. In reality Big Data is much more than just a huge volume of the data. When talking about the data we need to understand about variety and volume along with volume. Though Big data look like a simple concept, it is extremely complex subject when we attempt to start learning the same.

My Journey

I have recently presented on Big Data in quite a few organizations and I have received quite a few questions during this roadshow event. I have collected all the questions which I have received and decided to post about them on the blog. In the month of October 2013, on every weekday we will be learning something new about Big Data. Every day I will share a concept/question and in the same blog post we will learn the answer of the same.

bigdata foto Big Data   Beginning Big Data Series Next Month in 21 Parts

Big Data – Plenty of Questions

I received quite a few questions during my road trip. Here are few of the questions.

  • I want to learn Big Data – where should I start?
  • Do I need to know SQL to learn Big Data?
  • What is Hadoop?
  • There are so many organizations talking about Big Data, and every one has a different approach. How to start with big Data?
  • Do I need to know Java to learn about Big Data?
  • What is different between various NoSQL languages.

I will attempt to answer most of the questions during the month long series in the next month.

Big Data – Big Subject

Big Data is a very big subject and I no way claim that I will be covering every single big data concept in this series. However, I promise that I will be indeed sharing lots of basic concepts which are revolving around Big Data. We will discuss from fundamentals about Big Data and continue further learning about it. I will attempt to cover the concept so simple that many of you might have wondered about it but afraid to ask.

Your Role!

During this series next month, I need your one help. Please keep on posting questions you might have related to big data as blog post comments and on Facebook Page. I will monitor them closely and will try to answer them as well during this series.

Now make sure that you do not miss any single blog post in this series as every blog post will be linked to each other. You can subscribe to my feed or like my Facebook page or subscribe via email (by entering email in the blog post).

Reference: Pinal Dave (

SQL – Download FREE Book – Data Access for HighlyScalable Solutions: Using SQL, NoSQL, and Polyglot Persistence

Recently I was preparing for Big Data and I ended up on very interesting read for everybody. This is created by Microsoft and it is indeed a fantastic read as per my opinion. It took me some time to read this entire book but it was worth reading this as it tried to answer two of the very interesting questions related to muscle.

Here is the abstract from the book:

Organizations seeking to use a NoSQL database are therefore faced with a twofold challenge:

• Which NoSQL database(s) best meet(s) the needs of the organization?
• How does an organization integrate a NoSQL database into its solutions?

As I keep on reading the book, I find it very interesting and informative. I suggest if you have time this weekend, download the book and read it. This guide focuses on the most common types of NoSQL database currently available, describes the situations for which they are most suited, and shows examples of how you might incorporate them into a business application. The guide summarizes the experiences of a fictitious organization named Adventure Works, who implemented a solution that comprised an assortment of different databases.

Download Data Access for HighlyScalable Solutions:  Using SQL, NoSQL,  and Polyglot Persistence

While we are talking about Big Data and NoSQL do not forget to check out my tomorrow’s blog as I am going to talk about the same subject and it will be very interesting.

Reference: Pinal Dave (