Big Data – Basics of Big Data Analytics – Day 18 of 21

In yesterday’s blog post we learned the importance of the various components in Big Data Story. In this article we will understand what are the various analytics tasks we try to achieve with the Big Data and the list of the important tools in Big Data Story.

When you have plenty of the data around you what is the first thing which comes to your mind?

“What do all these data means?”

Exactly – the same thought comes to my mind as well. I always wanted to know what all the data means and what meaningful information I can receive out of it. Most of the Big Data projects are built to retrieve various intelligence all this data contains within it. Let us take example of Facebook. When I look at my friends list of Facebook, I always want to ask many questions such as –

  • On which date my maximum friends have a birthday?
  • What is the most favorite film of my most of the friends so I can talk about it and engage them?
  • What is the most liked placed to travel my friends?
  • Which is the most disliked cousin for my friends in India and USA so when they travel, I do not take them there.

There are many more questions I can think of. This illustrates that how important it is to have analysis of Big Data.

Here are few of the kind of analysis listed which you can use with Big Data.

Slicing and Dicing: This means breaking down your data into smaller set and understanding them one set at a time. This also helps to present various information in a variety of different user digestible ways. For example if you have data related to movies, you can use different slide and dice data in various formats like actors, movie length etc.

Real Time Monitoring: This is very crucial in social media when there are any events happening and you wanted to measure the impact at the time when the event is happening. For example, if you are using twitter when there is a football match, you can watch what fans are talking about football match on twitter when the event is happening.

Anomaly Predication and Modeling: If the business is running normal it is alright but if there are signs of trouble, everyone wants to know them early on the hand. Big Data analysis of various patterns can be very much helpful to predict future. Though it may not be always accurate but certain hints and signals can be very helpful. For example, lots of data can help conclude that if there is lots of rain it can increase the sell of umbrella.

Text and Unstructured Data Analysis: unstructured data are now getting norm in the new world and they are a big part of the Big Data revolution. It is very important that we Extract, Transform and Load the unstructured data and make meaningful data out of it. For example, analysis of lots of images, one can predict that people like to use certain colors in certain months in their cloths.

Big Data Analytics Solutions

There are many different Big Data Analystics Solutions out in the market. It is impossible to list all of them so I will list a few of them over here.

  • Tableau – This has to be one of the most popular visualization tools out in the big data market.
  • SAS – A high performance analytics and infrastructure company
  • IBM and Oracle – They have a range of tools for Big Data Analysis

Tomorrow

In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Data Scientist.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21

In yesterday’s blog post we learned the importance of the Pig and Pig Latin in Big Data Story. In this article we will understand what is Sqoop and Zookeeper in Big Data Story.

There are two most important components one should learn when learning about interacting with Hadoop – Sqoop and Zookper.

What is Sqoop?

Most of the business stores their data in RDBMS as well as other data warehouse solutions. They need a way to move data to the Hadoop system to do various processing and return it back to RDBMS from Hadoop system. The data movement can happen in real time or at various intervals in bulk. We need a tool which can help us move this data from SQL to Hadoop and from Hadoop to SQL. Sqoop (SQL to Hadoop) is such a tool which extract data from non-Hadoop data sources and transform them into the format which Hadoop can use it and later it loads them into HDFS. Essentially it is ETL tool where it Extracts, Transform and Load from SQL to Hadoop. The best part is that it also does extract data from Hadoop and loads them to Non-SQL (or RDBMS) data stores. Essentially, Sqoop is a command line tool which does SQL to Hadoop and Hadoop to SQL. It is a command line interpreter. It creates MapReduce job behinds the scene to import data from an external database to HDFS. It is very effective and easy to learn tool for nonprogrammers.

sqoop Big Data   Interacting with Hadoop   What is Sqoop?   What is Zookeeper?   Day 17 of 21

What is Zookeeper?

zookeeper Big Data   Interacting with Hadoop   What is Sqoop?   What is Zookeeper?   Day 17 of 21ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In other words Zookeeper is a replicated synchronization service with eventual consistency. In simpler words – in Hadoop cluster there are many different nodes and one node is master. Let us assume that master node fails due to any reason. In this case, the role of the master node has to be transferred to a different node. The main role of the master node is managing the writers as that task requires persistence in order of writing. In this kind of scenario Zookeeper will assign new master node and make sure that Hadoop cluster performs without any glitch. Zookeeper is the Hadoop’s method of coordinating all the elements of these distributed systems. Here are few of the tasks which Zookeepr is responsible for.

  • Zookeeper manages the entire workflow of starting and stopping various nodes in the Hadoop’s cluster.
  • In Hadoop cluster when any processes need certain configuration to complete the task. Zookeeper makes sure that certain node gets necessary configuration consistently.
  • In case of the master node fails, Zookeepr can assign new master node and make sure cluster works as expected.

There many other tasks Zookeeper performance when it is about Hadoop cluster and communication. Basically without the help of Zookeeper it is not possible to design any new fault tolerant distributed application.

Tomorrow

In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Big Data Analytics.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21

In yesterday’s blog post we learned the importance of the HIVE in Big Data Story. In this article we will understand what is PIG and PIG Latin in Big Data Story.

Yahoo started working on Pig for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data.

What is Pig and What is Pig Latin?

pig open Big Data   Interacting with Hadoop   What is PIG?   What is PIG Latin?   Day 16 of 21Pig is a high level platform for creating MapReduce programs used with Hadoop and the language we use for this platform is called PIG Latin. The pig was designed to make Hadoop more user-friendly and approachable by power-users and nondevelopers. PIG is an interactive execution environment supporting Pig Latin language. The language Pig Latin has supported loading and processing of input data with series of transforming to produce desired results. PIG has two different execution environments 1) Local Mode – In this case all the scripts run on a single machine. 2) Hadoop – In this case all the scripts run on Hadoop Cluster.

Pig Latin vs SQL

Pig essentially creates set of map and reduce jobs under the hoods. Due to same users does not have to now write, compile and build solution for Big Data. The pig is very similar to SQL in many ways. The Ping Latin language provide an abstraction layer over the data. It focuses on the data and not the structure under the hood. Pig Latin is a very powerful language and it can do various operations like loading and storing data, streaming data, filtering data as well various data operations related to strings. The major difference between SQL and Pig Latin is that PIG is procedural and SQL is declarative. In simpler words, Pig Latin is very similar to SQ Lexecution plan and that makes it much easier for programmers to build various processes. Whereas SQL handles trees naturally, Pig Latin follows directed acyclic graph (DAG). DAGs is used to model several different kinds of structures in mathematics and computer science.

dag Big Data   Interacting with Hadoop   What is PIG?   What is PIG Latin?   Day 16 of 21

DAG

Tomorrow

In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Zookeeper.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21

In yesterday’s blog post we learned the importance of the operational database in Big Data Story. In this article we will understand what is Hive and HQL in Big Data Story.

Yahoo started working on PIG (we will understand that in the next blog post) for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. Similarly Facebook started deploying their warehouse solutions on Hadoop which has resulted in HIVE. The reason for going with HIVE is because the traditional warehousing solutions are getting very expensive.

What is HIVE?

hive logo Big Data   Data Mining with Hive   What is Hive?   What is HiveQL (HQL)?   Day 15 of 21Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there.

HIVE Organization

The data are organized in three different formats in HIVE.

Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.

Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.

Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.

Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here.

hive arch Big Data   Data Mining with Hive   What is Hive?   What is HiveQL (HQL)?   Day 15 of 21

What is HiveSQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily.

  • Create and manage tables and partitions
  • Support various Relational, Arithmetic and Logical Operators
  • Evaluate functions
  • Download the contents of a table to a local directory or result of queries to HDFS directory

Here is the example of the HQL Query:

SELECT upper(name), salesprice
FROM sales;
SELECT category, count(1) 
FROM products 
GROUP BY category;

When you look at the above query, you can see they are very similar to SQL like queries.

Tomorrow

In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Pig.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Operational Databases Supporting Big Data – Columnar, Graph and Spatial Database – Day 14 of 21

In yesterday’s blog post we learned the importance of the Key-Value Pair Databases and Document Databases in the Big Data Story. In this article we will understand the role of Columnar, Graph and Spatial Database supporting Big Data Story.

Now we will see a few of the examples of the operational databases.

  • Relational Databases (The day before yesterday’s post)
  • NoSQL Databases (The day before yesterday’s post)
  • Key-Value Pair Databases (Yesterday’s post)
  • Document Databases (Yesterday’s post)
  • Columnar Databases (Tomorrow’s post)
  • Graph Databases (Today’s post)
  • Spatial Databases (Today’s post)

Columnar Databases 

Relational Database is a row store database or a row oriented database. Columnar databases are column oriented or column store databases. As we discussed earlier in Big Data we have different kinds of data and we need to store different kinds of data in the database. When we have columnar database it is very easy to do so as we can just add a new column to the columnar database. HBase is one of the most popular columnar databases. It uses Hadoop file system and MapReduce for its core data storage. However, remember this is not a good solution for every application. This is particularly good for the database where there is high volume incremental data is gathered and processed.

Graph Databases

For a highly interconnected data it is suitable to use Graph Database. This database has node relationship structure. Nodes and relationships contain a Key Value Pair where data is stored. The major advantage of this database is that it supports faster navigation among various relationships. For example, Facebook uses a graph database to list and demonstrate various relationships between users. Neo4J is one of the most popular open source graph database. One of the major dis-advantage of the Graph Database is that it is not possible to self-reference (self joins in the RDBMS terms) and there might be real world scenarios where this might be required and graph database does not support it.

Spatial Databases 

We all use Foursquare, Google+ as well Facebook Check-ins for location aware check-ins. All the location aware applications figure out the position of the phone with the help of Global Positioning System (GPS). Think about it, so many different users at different location in the world and checking-in all together. Additionally, the applications now feature reach and users are demanding more and more information from them, for example like movies, coffee shop or places see. They are all running with the help of Spatial Databases. Spatial data are standardize by the Open Geospatial Consortium known as OGC. Spatial data helps answering many interesting questions like “Distance between two locations, area of interesting places etc.” When we think of it, it is very clear that handing spatial data and returning meaningful result is one big task when there are millions of users moving dynamically from one place to another place & requesting various spatial information. PostGIS/OpenGIS suite is very popular spatial database. It runs as a layer implementation on the RDBMS PostgreSQL. This makes it totally unique as it offers best from both the worlds.

big data ls Big Data   Operational Databases Supporting Big Data   Columnar, Graph and Spatial Database   Day 14 of 21

Courtesy: mushroom network

Tomorrow

In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Hive.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Operational Databases Supporting Big Data – Key-Value Pair Databases and Document Databases – Day 13 of 21

In yesterday’s blog post we learned the importance of the Relational Database and NoSQL database in the Big Data Story. In this article we will understand the role of Key-Value Pair Databases and Document Databases Supporting Big Data Story.

Now we will see a few of the examples of the operational databases.

  • Relational Databases (Yesterday’s post)
  • NoSQL Databases (Yesterday’s post)
  • Key-Value Pair Databases (This post)
  • Document Databases (This post)
  • Columnar Databases (Tomorrow’s post)
  • Graph Databases (Tomorrow’s post)
  • Spatial Databases (Tomorrow’s post)

Key Value Pair Databases

Key Value Pair Databases are also known as KVP databases. A key is a field name and attribute, an identifier. The content of that field is its value, the data that is being identified and stored.

They have a very simple implementation of NoSQL database concepts. They do not have schema hence they are very flexible as well as scalable. The disadvantages of Key Value Pair (KVP) database are that they do not follow ACID (Atomicity, Consistency, Isolation, Durability) properties. Additionally, it will require data architects to plan for data placement, replication as well as high availability. In KVP databases the data is stored as strings.

Here is a simple example of how Key Value Database will look like:

Key Value
Name Pinal Dave
Color Blue
Twitter @pinaldave
Name Nupur Dave
Movie The Hero

As the number of users grow in Key Value Pair databases it starts getting difficult to manage the entire database. As there is no specific schema or rules associated with the database, there are chances that database grows exponentially as well. It is very crucial to select the right Key Value Pair Database which offers an additional set of tools to manage the data and provides finer control over various business aspects of the same.

Riak

riak Big Data   Operational Databases Supporting Big Data   Key Value Pair Databases and Document Databases   Day 13 of 21Riack is one of the most popular Key Value Database. It is known for its scalability and performance in high volume and velocity database. Additionally, it implements a mechanism for collection key and values which further helps to build manageable system. We will further discuss Riak in future blog posts.

Key Value Databases are a good choice for social media, communities, caching layers for connecting other databases. In simpler words, whenever we required flexibility of the data storage keeping scalability in mind – KVP databases are good options to consider.

Document Database

There are two different kinds of document databases. 1) Full document Content (web pages, word docs etc) and 2) Storing Document Components for storage. The second types of the document database we are talking about over here. They use Javascript Object Notation (JSON) and Binary JSON for the structure of the documents. JSON is very easy to understand language and it is very easy to write for applications. There are two major structures of JSON used for Document Database – 1) Name Value Pairs and 2) Ordered List.

MongoDB and CouchDB are two of the most popular Open Source NonRelational Document Database.

MongoDB

mongodb Big Data   Operational Databases Supporting Big Data   Key Value Pair Databases and Document Databases   Day 13 of 21MongoDB databases are called collections. Each collection is build of documents and each document is composed of fields. MongoDB collections can be indexed for optimal performance. MongoDB ecosystem is highly available, supports query services as well as MapReduce. It is often used in high volume content management system.

CouchDB

couchdb Big Data   Operational Databases Supporting Big Data   Key Value Pair Databases and Document Databases   Day 13 of 21CouchDB databases are composed of documents which consists fields and attachments (known as description). It supports ACID properties. The main attraction points of CouchDB are that it will continue to operate even though network connectivity is sketchy. Due to this nature CouchDB prefers local data storage.

Document Database is a good choice of the database when users have to generate dynamic reports from elements which are changing very frequently. A good example of document usages is in real time analytics in social networking or content management system.

Tomorrow

In tomorrow’s blog post we will discuss about various other Operational Databases supporting Big Data.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Operational Databases Supporting Big Data – RDBMS and NoSQL – Day 12 of 21

In yesterday’s blog post we learned the importance of the Cloud in the Big Data Story. In this article we will understand the role of Operational Databases Supporting Big Data Story.

Even though we keep on talking about Big Data architecture, it is extremely crucial to understand that Big Data system can’t just exist in the isolation of itself. There are many needs of the business can only be fully filled with the help of the operational databases. Just having a system which can analysis big data may not solve every single data problem.

Real World Example

Think about this way, you are using Facebook and you have just updated your information about the current relationship status. In the next few seconds the same information is also reflected in the timeline of your partner as well as a few of the immediate friends. After a while you will notice that the same information is now also available to your remote friends. Later on when someone searches for all the relationship changes with their friends your change of the relationship will also show up in the same list. Now here is the question – do you think Big Data architecture is doing every single of these changes? Do you think that the immediate reflection of your relationship changes with your family member is also because of the technology used in Big Data. Actually the answer is Facebook uses MySQL to do various updates in the timeline as well as various events we do on their homepage. It is really difficult to part from the operational databases in any real world business.

Now we will see a few of the examples of the operational databases.

  • Relational Databases (This blog post)
  • NoSQL Databases (This blog post)
  • Key-Value Pair Databases (Tomorrow’s post)
  • Document Databases (Tomorrow’s post)
  • Columnar Databases (The Day After’s post)
  • Graph Databases (The Day After’s post)
  • Spatial Databases (The Day After’s post)

Relational Databases

We have earlier discussed about the RDBMS role in the Big Data’s story in detail so we will not cover it extensively over here. Relational Database is pretty much everywhere in most of the businesses which are here for many years. The importance and existence of the relational database are always going to be there as long as there are meaningful structured data around. There are many different kinds of relational databases for example Oracle, SQL Server, MySQL and many others. If you are looking for Open Source and widely accepted database, I suggest to try MySQL as that has been very popular in the last few years. I also suggest you to try out PostgreSQL as well. Besides many other essential qualities PostgreeSQL have very interesting licensing policies. PostgreSQL licenses allow modifications and distribution of the application in open or closed (source) form. One can make any modifications and can keep it private as well as well contribute to the community. I believe this one quality makes it much more interesting to use as well it will play very important role in future.

Nonrelational Databases (NOSQL)

We have also covered Nonrelational Dabases in earlier blog posts. NoSQL actually stands for Not Only SQL Databases. There are plenty of NoSQL databases out in the market and selecting the right one is always very challenging. Here are few of the properties which are very essential to consider when selecting the right NoSQL database for operational purpose.

  • Data and Query Model
  • Persistence of Data and Design
  • Eventual Consistency
  • Scalability

Though above all of the properties are interesting to have in any NoSQL database but the one which most attracts to me is Eventual Consistency.

Eventual Consistency

RDBMS uses ACID (Atomicity, Consistency, Isolation, Durability) as a key mechanism for ensuring the data consistency, whereas NonRelational DBMS uses BASE for the same purpose. Base stands for Basically Available, Soft state and Eventual consistency. Eventual consistency is widely deployed in distributed systems. It is a consistency model used in distributed computing which expects unexpected often. In large distributed system, there are always various nodes joining and various nodes being removed as they are often using commodity servers. This happens either intentionally or accidentally. Even though one or more nodes are down, it is expected that entire system still functions normally. Applications should be able to do various updates as well as retrieval of the data successfully without any issue. Additionally, this also means that system is expected to return the same updated data anytime from all the functioning nodes. Irrespective of when any node is joining the system, if it is marked to hold some data it should contain the same updated data eventually.

Eventual Consistency Big Data   Operational Databases Supporting Big Data   RDBMS and NoSQL   Day 12 of 21

As per Wikipedia – Eventual consistency is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

In other words –  Informally, if no additional updates are made to a given data item, all reads to that item will eventually return the same value.

Tomorrow

In tomorrow’s blog post we will discuss about various other Operational Databases supporting Big Data.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Role of Cloud Computing in Big Data – Day 11 of 21

In yesterday’s blog post we learned the importance of the NewSQL. In this article we will understand the role of Cloud in Big Data Story

What is Cloud?

Cloud is the biggest buzzword around from last few years. Everyone knows about the Cloud and it is extremely well defined online. In this article we will discuss cloud in the context of the Big Data. Cloud computing is a method of providing a shared computing resources to the application which requires dynamic resources. These resources include applications, computing, storage, networking, development and various deployment platforms. The fundamentals of the cloud computing are that it shares pretty much share all the resources and deliver to end users as a service.

 Examples of the Cloud Computing and Big Data are Google and Amazon.com. Both have fantastic Big Data offering with the help of the cloud. We will discuss this later in this blog post.

There are two different Cloud Deployment Models: 1) The Public Cloud and 2) The Private Cloud

Public Cloud

Public Cloud is the cloud infrastructure build by commercial providers (Amazon, Rackspace etc.) creates a highly scalable data center that hides the complex infrastructure from the consumer and provides various services.

Private Cloud

Private Cloud is the cloud infrastructure build by a single organization where they are managing highly scalable data center internally.

Here is the quick comparison between Public Cloud and Private Cloud from Wikipedia:

  Public Cloud Private Cloud
Initial cost Typically zero Typically high
Running cost Unpredictable Unpredictable
Customization Impossible Possible
Privacy No (Host has access to the data Yes
Single sign-on Impossible Possible
Scaling up Easy while within defined limits Laborious but no limits

Hybrid Cloud

Hybrid Cloud is the cloud infrastructure build with the composition of two or more clouds like public and private cloud. Hybrid cloud gives best of the both the world as it combines multiple cloud deployment models together.

Cloud and Big Data – Common Characteristics

big data cloud Big Data   Role of Cloud Computing in Big Data   Day 11 of 21There are many characteristics of the Cloud Architecture and Cloud Computing which are also essentially important for Big Data as well. They highly overlap and at many places it just makes sense to use the power of both the architecture and build a highly scalable framework.

Here is the list of all the characteristics of cloud computing important in Big Data

  • Scalability
  • Elasticity
  • Ad-hoc Resource Pooling
  • Low Cost to Setup Infastructure
  • Pay on Use or Pay as you Go
  • Highly Available

Leading Big Data Cloud Providers

There are many players in Big Data Cloud but we will list a few of the known players in this list.

Amazon

Amazon is arguably the most popular Infrastructure as a Service (IaaS) provider. The history of how Amazon started in this business is very interesting. They started out with a massive infrastructure to support their own business. Gradually they figured out that their own resources are underutilized most of the time. They decided to get the maximum out of the resources they have and hence  they launched their Amazon Elastic Compute Cloud (Amazon EC2) service in 2006. Their products have evolved a lot recently and now it is one of their primary business besides their retail selling.

Amazon also offers Big Data services understand Amazon Web Services. Here is the list of the included services:

  • Amazon Elastic MapReduce – It processes very high volumes of data
  • Amazon DynammoDB – It is fully managed NoSQL (Not Only SQL) database service
  • Amazon Simple Storage Services (S3) – A web-scale service designed to store and accommodate any amount of data
  • Amazon High Performance Computing – It provides low-tenancy tuned high performance computing cluster
  • Amazon RedShift – It is petabyte scale data warehousing service

Google

Though Google is known for Search Engine, we all know that it is much more than that.

  • Google Compute Engine – It offers secure, flexible computing from energy efficient data centers
  • Google Big Query – It allows SQL-like queries to run against large datasets
  • Google Prediction API – It is a cloud based machine learning tool

Other Players

Besides Amazon and Google we also have other players in the Big Data market as well. Microsoft is also attempting Big Data with the Cloud with Microsoft Azure. Additionally Rackspace and NASA together have initiated OpenStack. The goal of Openstack is to provide a massively scaled, multitenant cloud that can run on any hardware.

Thing to Watch

The cloud based solutions provides a great integration with the Big Data’s story as well it is very economical to implement as well. However, there are few things one should be very careful when deploying Big Data on cloud solutions. Here is a list of a few things to watch:

  • Data Integrity
  • Initial Cost
  • Recurring Cost
  • Performance
  • Data Access Security
  • Location
  • Compliance

Every company have different approaches to Big Data and have different rules and regulations. Based on various factors, one can implement their own custom Big Data solution on a cloud.

Tomorrow

In tomorrow’s blog post we will discuss about various Operational Databases supporting Big Data.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Buzz Words: What is NewSQL – Day 10 of 21

In yesterday’s blog post we learned the importance of the relational database. In this article we will take a quick look at the what is NewSQL.

What is NewSQL?

bigdata Big Data   Buzz Words: What is NewSQL   Day 10 of 21NewSQL stands for new scalable and high performance SQL Database vendors. The products sold by NewSQL vendors are horizontally scalable. NewSQL is not kind of databases but it is about vendors who supports emerging data products with relational database properties (like ACID, Transaction etc.) along with high performance. Products from NewSQL vendors usually follow in memory data for speedy access as well are available immediate scalability.

NewSQL term was coined by 451 groups analyst Matthew Aslett in this particular blog post.

On the definition of NewSQL, Aslett writes:

NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL‘ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.

In other words – NewSQL incorporates the concepts and principles of Structured Query Language (SQL) and NoSQL languages. It combines reliability of SQL with the speed and performance of NoSQL.

Categories of NewSQL

There are three major categories of the NewSQL

New Architecture – In this framework each node owns a subset of the data and queries are split into smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB

MySQL Engines – Highly Optimized storage engine for SQL with the interface of MySQ Lare the example of such category. E.g. InnoDB, Akiban

Transparent Sharding – This system automatically split database across multiple nodes. E.g. Scalearc 

Summary

In simple words – NewSQL is kind of database following relational database principals and provides scalability like NoSQL.

Tomorrow

In tomorrow’s blog post we will discuss about the Role of Cloud Computing in Big Data.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Buzz Words: Importance of Relational Database in Big Data World – Day 9 of 21

In yesterday’s blog post we learned what is HDFS. In this article we will take a quick look at the importance of the Relational Database in Big Data world.

A Big Question?

iceberg Big Data   Buzz Words: Importance of Relational Database in Big Data World   Day 9 of 21Here are a few questions I often received since the beginning of the Big Data Series –

  • Does the relational database have no space in the story of the Big Data?
  • Does relational database is no longer relevant as Big Data is evolving?
  • Is relational database not capable to handle Big Data?
  • Is it true that one no longer has to learn about relational data if Big Data is the final destination?

Well, every single time when I hear that one person wants to learn about Big Data and is no longer interested in learning about relational database, I find it as a bit far stretched.

I am not here to give ambiguous answers of It Depends. I am personally very clear that one who is aspiring to become Big Data Scientist or Big Data Expert they should learn about relational database.

NoSQL Movement

The reason for the NoSQL Movement in recent time was because of the two important advantages of the NoSQL databases.

  1. Performance
  2. Flexible Schema

In personal experience I have found that when I use NoSQL I have found both of the above listed advantages when I use NoSQL database. There are instances when I found relational database too much restrictive when my data is unstructured as well as they have in the datatype which my Relational Database does not support. It is the same case when I have found that NoSQL solution performing much better than relational databases. I must say that I am a big fan of NoSQL solutions in the recent times but I have also seen occasions and situations where relational database is still perfect fit even though the database is growing increasingly as well have all the symptoms of the big data.

Situations in Relational Database Outperforms

Adhoc reporting is the one of the most common scenarios where NoSQL is does not have optimal solution. For example reporting queries often needs to aggregate based on the columns which are not indexed as well are built while the report is running, in this kind of scenario NoSQL databases (document database stores, distributed key value stores) database often does not perform well. In the case of the ad-hoc reporting I have often found it is much easier to work with relational databases.

SQL is the most popular computer language of all the time. I have been using it for almost over 10 years and I feel that I will be using it for a long time in future. There are plenty of the tools, connectors and awareness of the SQL language in the industry. Pretty much every programming language has a written drivers for the SQL language and most of the developers have learned this language during their school/college time. In many cases, writing query based on SQL is much easier than writing queries in NoSQL supported languages. I believe this is the current situation but in the future this situation can reverse when No SQL query languages are equally popular.

ACID (Atomicity Consistency Isolation Durability) – Not all the NoSQL solutions offers ACID compliant language. There are always situations (for example banking transactions, eCommerce shopping carts etc.) where if there is no ACID the operations can be invalid as well database integrity can be at risk. Even though the data volume indeed qualify as a Big Data there are always operations in the application which absolutely needs ACID compliance matured language.

The Mixed Bag

I have often heard argument that all the big social media sites now a days have moved away from Relational Database. Actually this is not entirely true. While researching about Big Data and Relational Database, I have found that many of the popular social media sites uses Big Data solutions along with Relational Database. Many are using relational databases to deliver the results to end user on the run time and many still uses a relational database as their major backbone.

Here are a few examples:

There are many for prominent organizations which are running large scale applications uses relational database along with various Big Data frameworks to satisfy their various business needs.

Summary

I believe that RDBMS is like a vanilla ice cream. Everybody loves it and everybody has it. NoSQL and other solutions are like chocolate ice cream or custom ice cream – there is a huge base which loves them and wants them but not every ice cream maker can make it just right  for everyone’s taste. No matter how fancy an ice cream store is there is always plain vanilla ice cream available there. Just like the same, there are always cases and situations in the Big Data’s story where traditional relational database is the part of the whole story. In the real world scenarios there will be always the case when there will be need of the relational database concepts and its ideology. It is extremely important to accept relational database as one of the key components of the Big Data instead of treating it as a substandard technology.

Ray of Hope – NewSQL

In this module we discussed that there are places where we need ACID compliance from our Big Data application and NoSQL will not support that out of box. There is a new termed coined for the application/tool which supports most of the properties of the traditional RDBMS and supports Big Data infrastructure – NewSQL.

Tomorrow

In tomorrow’s blog post we will discuss about NewSQL.

Reference: Pinal Dave (http://blog.sqlauthority.com)