Big Data – Operational Databases Supporting Big Data – Key-Value Pair Databases and Document Databases – Day 13 of 21

In yesterday’s blog post we learned the importance of the Relational Database and NoSQL database in the Big Data Story. In this article we will understand the role of Key-Value Pair Databases and Document Databases Supporting Big Data Story.

Now we will see a few of the examples of the operational databases.

  • Relational Databases (Yesterday’s post)
  • NoSQL Databases (Yesterday’s post)
  • Key-Value Pair Databases (This post)
  • Document Databases (This post)
  • Columnar Databases (Tomorrow’s post)
  • Graph Databases (Tomorrow’s post)
  • Spatial Databases (Tomorrow’s post)

Key Value Pair Databases

Key Value Pair Databases are also known as KVP databases. A key is a field name and attribute, an identifier. The content of that field is its value, the data that is being identified and stored.

They have a very simple implementation of NoSQL database concepts. They do not have schema hence they are very flexible as well as scalable. The disadvantages of Key Value Pair (KVP) database are that they do not follow ACID (Atomicity, Consistency, Isolation, Durability) properties. Additionally, it will require data architects to plan for data placement, replication as well as high availability. In KVP databases the data is stored as strings.

Here is a simple example of how Key Value Database will look like:

Key Value
Name Pinal Dave
Color Blue
Twitter @pinaldave
Name Nupur Dave
Movie The Hero

As the number of users grow in Key Value Pair databases it starts getting difficult to manage the entire database. As there is no specific schema or rules associated with the database, there are chances that database grows exponentially as well. It is very crucial to select the right Key Value Pair Database which offers an additional set of tools to manage the data and provides finer control over various business aspects of the same.


Riack is one of the most popular Key Value Database. It is known for its scalability and performance in high volume and velocity database. Additionally, it implements a mechanism for collection key and values which further helps to build manageable system. We will further discuss Riak in future blog posts.

Key Value Databases are a good choice for social media, communities, caching layers for connecting other databases. In simpler words, whenever we required flexibility of the data storage keeping scalability in mind – KVP databases are good options to consider.

Document Database

There are two different kinds of document databases. 1) Full document Content (web pages, word docs etc) and 2) Storing Document Components for storage. The second types of the document database we are talking about over here. They use Javascript Object Notation (JSON) and Binary JSON for the structure of the documents. JSON is very easy to understand language and it is very easy to write for applications. There are two major structures of JSON used for Document Database – 1) Name Value Pairs and 2) Ordered List.

MongoDB and CouchDB are two of the most popular Open Source NonRelational Document Database.


MongoDB databases are called collections. Each collection is build of documents and each document is composed of fields. MongoDB collections can be indexed for optimal performance. MongoDB ecosystem is highly available, supports query services as well as MapReduce. It is often used in high volume content management system.


CouchDB databases are composed of documents which consists fields and attachments (known as description). It supports ACID properties. The main attraction points of CouchDB are that it will continue to operate even though network connectivity is sketchy. Due to this nature CouchDB prefers local data storage.

Document Database is a good choice of the database when users have to generate dynamic reports from elements which are changing very frequently. A good example of document usages is in real time analytics in social networking or content management system.


In tomorrow’s blog post we will discuss about various other Operational Databases supporting Big Data.

Reference: Pinal Dave (

Big Data – Operational Databases Supporting Big Data – RDBMS and NoSQL – Day 12 of 21

In yesterday’s blog post we learned the importance of the Cloud in the Big Data Story. In this article we will understand the role of Operational Databases Supporting Big Data Story.

Even though we keep on talking about Big Data architecture, it is extremely crucial to understand that Big Data system can’t just exist in the isolation of itself. There are many needs of the business can only be fully filled with the help of the operational databases. Just having a system which can analysis big data may not solve every single data problem.

Real World Example

Think about this way, you are using Facebook and you have just updated your information about the current relationship status. In the next few seconds the same information is also reflected in the timeline of your partner as well as a few of the immediate friends. After a while you will notice that the same information is now also available to your remote friends. Later on when someone searches for all the relationship changes with their friends your change of the relationship will also show up in the same list. Now here is the question – do you think Big Data architecture is doing every single of these changes? Do you think that the immediate reflection of your relationship changes with your family member is also because of the technology used in Big Data. Actually the answer is Facebook uses MySQL to do various updates in the timeline as well as various events we do on their homepage. It is really difficult to part from the operational databases in any real world business.

Now we will see a few of the examples of the operational databases.

  • Relational Databases (This blog post)
  • NoSQL Databases (This blog post)
  • Key-Value Pair Databases (Tomorrow’s post)
  • Document Databases (Tomorrow’s post)
  • Columnar Databases (The Day After’s post)
  • Graph Databases (The Day After’s post)
  • Spatial Databases (The Day After’s post)

Relational Databases

We have earlier discussed about the RDBMS role in the Big Data’s story in detail so we will not cover it extensively over here. Relational Database is pretty much everywhere in most of the businesses which are here for many years. The importance and existence of the relational database are always going to be there as long as there are meaningful structured data around. There are many different kinds of relational databases for example Oracle, SQL Server, MySQL and many others. If you are looking for Open Source and widely accepted database, I suggest to try MySQL as that has been very popular in the last few years. I also suggest you to try out PostgreSQL as well. Besides many other essential qualities PostgreeSQL have very interesting licensing policies. PostgreSQL licenses allow modifications and distribution of the application in open or closed (source) form. One can make any modifications and can keep it private as well as well contribute to the community. I believe this one quality makes it much more interesting to use as well it will play very important role in future.

Nonrelational Databases (NOSQL)

We have also covered Nonrelational Dabases in earlier blog posts. NoSQL actually stands for Not Only SQL Databases. There are plenty of NoSQL databases out in the market and selecting the right one is always very challenging. Here are few of the properties which are very essential to consider when selecting the right NoSQL database for operational purpose.

  • Data and Query Model
  • Persistence of Data and Design
  • Eventual Consistency
  • Scalability

Though above all of the properties are interesting to have in any NoSQL database but the one which most attracts to me is Eventual Consistency.

Eventual Consistency

RDBMS uses ACID (Atomicity, Consistency, Isolation, Durability) as a key mechanism for ensuring the data consistency, whereas NonRelational DBMS uses BASE for the same purpose. Base stands for Basically Available, Soft state and Eventual consistency. Eventual consistency is widely deployed in distributed systems. It is a consistency model used in distributed computing which expects unexpected often. In large distributed system, there are always various nodes joining and various nodes being removed as they are often using commodity servers. This happens either intentionally or accidentally. Even though one or more nodes are down, it is expected that entire system still functions normally. Applications should be able to do various updates as well as retrieval of the data successfully without any issue. Additionally, this also means that system is expected to return the same updated data anytime from all the functioning nodes. Irrespective of when any node is joining the system, if it is marked to hold some data it should contain the same updated data eventually.

As per Wikipedia – Eventual consistency is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

In other words –  Informally, if no additional updates are made to a given data item, all reads to that item will eventually return the same value.


In tomorrow’s blog post we will discuss about various other Operational Databases supporting Big Data.

Reference: Pinal Dave (

Big Data – Role of Cloud Computing in Big Data – Day 11 of 21

In yesterday’s blog post we learned the importance of the NewSQL. In this article we will understand the role of Cloud in Big Data Story

What is Cloud?

Cloud is the biggest buzzword around from last few years. Everyone knows about the Cloud and it is extremely well defined online. In this article we will discuss cloud in the context of the Big Data. Cloud computing is a method of providing a shared computing resources to the application which requires dynamic resources. These resources include applications, computing, storage, networking, development and various deployment platforms. The fundamentals of the cloud computing are that it shares pretty much share all the resources and deliver to end users as a service.

 Examples of the Cloud Computing and Big Data are Google and Both have fantastic Big Data offering with the help of the cloud. We will discuss this later in this blog post.

There are two different Cloud Deployment Models: 1) The Public Cloud and 2) The Private Cloud

Public Cloud

Public Cloud is the cloud infrastructure build by commercial providers (Amazon, Rackspace etc.) creates a highly scalable data center that hides the complex infrastructure from the consumer and provides various services.

Private Cloud

Private Cloud is the cloud infrastructure build by a single organization where they are managing highly scalable data center internally.

Here is the quick comparison between Public Cloud and Private Cloud from Wikipedia:

  Public Cloud Private Cloud
Initial cost Typically zero Typically high
Running cost Unpredictable Unpredictable
Customization Impossible Possible
Privacy No (Host has access to the data Yes
Single sign-on Impossible Possible
Scaling up Easy while within defined limits Laborious but no limits

Hybrid Cloud

Hybrid Cloud is the cloud infrastructure build with the composition of two or more clouds like public and private cloud. Hybrid cloud gives best of the both the world as it combines multiple cloud deployment models together.

Cloud and Big Data – Common Characteristics

There are many characteristics of the Cloud Architecture and Cloud Computing which are also essentially important for Big Data as well. They highly overlap and at many places it just makes sense to use the power of both the architecture and build a highly scalable framework.

Here is the list of all the characteristics of cloud computing important in Big Data

  • Scalability
  • Elasticity
  • Ad-hoc Resource Pooling
  • Low Cost to Setup Infastructure
  • Pay on Use or Pay as you Go
  • Highly Available

Leading Big Data Cloud Providers

There are many players in Big Data Cloud but we will list a few of the known players in this list.


Amazon is arguably the most popular Infrastructure as a Service (IaaS) provider. The history of how Amazon started in this business is very interesting. They started out with a massive infrastructure to support their own business. Gradually they figured out that their own resources are underutilized most of the time. They decided to get the maximum out of the resources they have and hence  they launched their Amazon Elastic Compute Cloud (Amazon EC2) service in 2006. Their products have evolved a lot recently and now it is one of their primary business besides their retail selling.

Amazon also offers Big Data services understand Amazon Web Services. Here is the list of the included services:

  • Amazon Elastic MapReduce – It processes very high volumes of data
  • Amazon DynammoDB – It is fully managed NoSQL (Not Only SQL) database service
  • Amazon Simple Storage Services (S3) – A web-scale service designed to store and accommodate any amount of data
  • Amazon High Performance Computing – It provides low-tenancy tuned high performance computing cluster
  • Amazon RedShift – It is petabyte scale data warehousing service


Though Google is known for Search Engine, we all know that it is much more than that.

  • Google Compute Engine – It offers secure, flexible computing from energy efficient data centers
  • Google Big Query – It allows SQL-like queries to run against large datasets
  • Google Prediction API – It is a cloud based machine learning tool

Other Players

Besides Amazon and Google we also have other players in the Big Data market as well. Microsoft is also attempting Big Data with the Cloud with Microsoft Azure. Additionally Rackspace and NASA together have initiated OpenStack. The goal of Openstack is to provide a massively scaled, multitenant cloud that can run on any hardware.

Thing to Watch

The cloud based solutions provides a great integration with the Big Data’s story as well it is very economical to implement as well. However, there are few things one should be very careful when deploying Big Data on cloud solutions. Here is a list of a few things to watch:

  • Data Integrity
  • Initial Cost
  • Recurring Cost
  • Performance
  • Data Access Security
  • Location
  • Compliance

Every company have different approaches to Big Data and have different rules and regulations. Based on various factors, one can implement their own custom Big Data solution on a cloud.


In tomorrow’s blog post we will discuss about various Operational Databases supporting Big Data.

Reference: Pinal Dave (

Big Data – Buzz Words: What is NewSQL – Day 10 of 21

In yesterday’s blog post we learned the importance of the relational database. In this article we will take a quick look at the what is NewSQL.

What is NewSQL?

NewSQL stands for new scalable and high performance SQL Database vendors. The products sold by NewSQL vendors are horizontally scalable. NewSQL is not kind of databases but it is about vendors who supports emerging data products with relational database properties (like ACID, Transaction etc.) along with high performance. Products from NewSQL vendors usually follow in memory data for speedy access as well are available immediate scalability.

NewSQL term was coined by 451 groups analyst Matthew Aslett in this particular blog post.

On the definition of NewSQL, Aslett writes:

NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL‘ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.

In other words – NewSQL incorporates the concepts and principles of Structured Query Language (SQL) and NoSQL languages. It combines reliability of SQL with the speed and performance of NoSQL.

Categories of NewSQL

There are three major categories of the NewSQL

New Architecture – In this framework each node owns a subset of the data and queries are split into smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB

MySQL Engines – Highly Optimized storage engine for SQL with the interface of MySQ Lare the example of such category. E.g. InnoDB, Akiban

Transparent Sharding – This system automatically split database across multiple nodes. E.g. Scalearc 


In simple words – NewSQL is kind of database following relational database principals and provides scalability like NoSQL.


In tomorrow’s blog post we will discuss about the Role of Cloud Computing in Big Data.

Reference: Pinal Dave (

Big Data – Buzz Words: Importance of Relational Database in Big Data World – Day 9 of 21

In yesterday’s blog post we learned what is HDFS. In this article we will take a quick look at the importance of the Relational Database in Big Data world.

A Big Question?

Here are a few questions I often received since the beginning of the Big Data Series -

  • Does the relational database have no space in the story of the Big Data?
  • Does relational database is no longer relevant as Big Data is evolving?
  • Is relational database not capable to handle Big Data?
  • Is it true that one no longer has to learn about relational data if Big Data is the final destination?

Well, every single time when I hear that one person wants to learn about Big Data and is no longer interested in learning about relational database, I find it as a bit far stretched.

I am not here to give ambiguous answers of It Depends. I am personally very clear that one who is aspiring to become Big Data Scientist or Big Data Expert they should learn about relational database.

NoSQL Movement

The reason for the NoSQL Movement in recent time was because of the two important advantages of the NoSQL databases.

  1. Performance
  2. Flexible Schema

In personal experience I have found that when I use NoSQL I have found both of the above listed advantages when I use NoSQL database. There are instances when I found relational database too much restrictive when my data is unstructured as well as they have in the datatype which my Relational Database does not support. It is the same case when I have found that NoSQL solution performing much better than relational databases. I must say that I am a big fan of NoSQL solutions in the recent times but I have also seen occasions and situations where relational database is still perfect fit even though the database is growing increasingly as well have all the symptoms of the big data.

Situations in Relational Database Outperforms

Adhoc reporting is the one of the most common scenarios where NoSQL is does not have optimal solution. For example reporting queries often needs to aggregate based on the columns which are not indexed as well are built while the report is running, in this kind of scenario NoSQL databases (document database stores, distributed key value stores) database often does not perform well. In the case of the ad-hoc reporting I have often found it is much easier to work with relational databases.

SQL is the most popular computer language of all the time. I have been using it for almost over 10 years and I feel that I will be using it for a long time in future. There are plenty of the tools, connectors and awareness of the SQL language in the industry. Pretty much every programming language has a written drivers for the SQL language and most of the developers have learned this language during their school/college time. In many cases, writing query based on SQL is much easier than writing queries in NoSQL supported languages. I believe this is the current situation but in the future this situation can reverse when No SQL query languages are equally popular.

ACID (Atomicity Consistency Isolation Durability) – Not all the NoSQL solutions offers ACID compliant language. There are always situations (for example banking transactions, eCommerce shopping carts etc.) where if there is no ACID the operations can be invalid as well database integrity can be at risk. Even though the data volume indeed qualify as a Big Data there are always operations in the application which absolutely needs ACID compliance matured language.

The Mixed Bag

I have often heard argument that all the big social media sites now a days have moved away from Relational Database. Actually this is not entirely true. While researching about Big Data and Relational Database, I have found that many of the popular social media sites uses Big Data solutions along with Relational Database. Many are using relational databases to deliver the results to end user on the run time and many still uses a relational database as their major backbone.

Here are a few examples:

There are many for prominent organizations which are running large scale applications uses relational database along with various Big Data frameworks to satisfy their various business needs.


I believe that RDBMS is like a vanilla ice cream. Everybody loves it and everybody has it. NoSQL and other solutions are like chocolate ice cream or custom ice cream – there is a huge base which loves them and wants them but not every ice cream maker can make it just right  for everyone’s taste. No matter how fancy an ice cream store is there is always plain vanilla ice cream available there. Just like the same, there are always cases and situations in the Big Data’s story where traditional relational database is the part of the whole story. In the real world scenarios there will be always the case when there will be need of the relational database concepts and its ideology. It is extremely important to accept relational database as one of the key components of the Big Data instead of treating it as a substandard technology.

Ray of Hope – NewSQL

In this module we discussed that there are places where we need ACID compliance from our Big Data application and NoSQL will not support that out of box. There is a new termed coined for the application/tool which supports most of the properties of the traditional RDBMS and supports Big Data infrastructure – NewSQL.


In tomorrow’s blog post we will discuss about NewSQL.

Reference: Pinal Dave (

Big Data – Buzz Words: What is HDFS – Day 8 of 21

In yesterday’s blog post we learned what is MapReduce. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – HDFS.

What is HDFS ?

HDFS stands for Hadoop Distributed File System and it is a primary storage system used by Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed on low-cost commodity hardware. In commodity hardware deployment server failures are very common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of failure.

HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each smaller piece to multiple times on different nodes. Hence when any node with the data crashes the system is automatically able to use the data from a different node and continue the process. This is the key feature of the HDFS system.

Architecture of HDFS

The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of single NameNode. This single NameNode is a master server and it manages the file system as well regulates access to various files. In additional to NameNode there are multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file is split into one or more blocks and those blocks are stored in a set of DataNodes.

The primary task of the NameNode is to open, close or rename files and directory and regulate access to the file system, whereas the primary task of the DataNode is read and write to the file systems. DataNode is also responsible for the creation, deletion or replication of the data based on the instruction from NameNode.

In reality, NameNode and DataNode are software designed to run on commodity machine build in Java language.

Visual Representation of HDFS Architecture

Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks directly to the DataNode. Client App does not have to directly write to all the node. It just has to write to any one of the node and NameNode will decide on which other DataNode it will have to replicate the data. In our example Client App directly writes to DataNode 1 and detained 3. However, data chunks are automatically replicated to other nodes. All the information like in which DataNode which data block is placed is written back to NameNode.

High Availability During Disaster

Now as multiple DataNode have same data blocks in the case of any DataNode which faces the disaster, the entire process will continue as other DataNode will assume the role to serve the specific data block which was on the failed node. This system provides very high tolerance to disaster and provides high availability.

If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop Application will stop performing as it is a single node where we store all the metadata. As this node is very critical, it is usually replicated on another clustered as well as on another data rack. Though, that replicated node is not operational in architecture, it has all the necessary data to perform the task of the NameNode in the case of the NameNode fails.

The entire Hadoop architecture is built to function smoothly even there are node failures or hardware malfunction. It is built on the simple concept that data is so big it is impossible to have come up with a single piece of the hardware which can manage it properly. We need lots of commodity (cheap) hardware to manage our big data and hardware failure is part of the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to overcome the limitation of the non-functioning hardware.


In tomorrow’s blog post we will discuss the importance of the relational database in Big Data.

Reference: Pinal Dave (

Big Data – Buzz Words: What is MapReduce – Day 7 of 21

In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce.

What is MapReduce?

MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time.

MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programing. The library where procedure Map() and Reduce() belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop which we will explore tomorrow.

Advantages of MapReduce Procedures

The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance. Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node.

As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines.

How do MapReduce Framework Works?

A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers.

Map() Procedure

There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node.

Reduce() Procedure

All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node.

The MapReduce Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data).

The MapReduce Framework has five different steps:

  • Preparing Map() Input
  • Executing User Provided Map() Code
  • Shuffle Map Output to Reduce Processor
  • Executing User Provided Reduce Code
  • Producing the Final Output

Here is the Dataflow of MapReduce Framework:

  • Input Reader
  • Map Function
  • Partition Function
  • Compare Function
  • Reduce Function
  • Output Writer

In a future blog post of this 31 day series we will explore various components of MapReduce in Detail.

MapReduce in a Single Statement

MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database.


In tomorrow’s blog post we will discuss Buzz Word – HDFS.

Reference: Pinal Dave (