Big Data – Real-Time Analytics Performance with ClustrixDB

Note: The product used in comparison is ClustrixDB. It is available to download for FREE.

NewSQL databases provide scale-out of NoSQL without giving up on SQL or ACID transactions. While most NewSQL databases focus only on transactions, ClustrixDB also provides fast real-time analytics that are becoming increasing important to many businesses. ClustrixDB does this by bringing Massively Parallel Processing (MPP) used in data warehouses, to the primary database.

So, I decided to get a workload and try it out to see what kind of performance improvements one can get, if any. Since, joins and aggregates are the workhorses of real-time analytics processing, they are a good place to start.

Configuration

I built a simple dataset with two tables USERS (100K rows), USER_ADDRESSES (200K rows) and BIDS (10M rows) so this dataset has 2GB of data (mysqldump). For platform I used AWS and got ClustrixDB from AWS Marketplace. For comparison, I decided to use MySQL 5.6 since the exact same data and queries can be run on both databases. For both databases, the instance types are m1.xlarge.

MySQL does not scale beyond a single server and is usually deployed with master and two read slaves. Since ClustrixDB provides horizontal scale-out within one cluster, rather than master-slave (with multiple copies of data), the equivalent configuration is 3 servers. ClustrixDB horizontal scaling allows all nodes to participate in all query types. For measuring performance single MySQL is enough because performance for one query will be the same – whether we use the master or read slave.

For ClustrixDB, I also tried out 6 servers to see if analytics get faster as you add servers.

Here is the resulting table:

Results

We see that some queries get significantly faster, however one query showed no performance improvement. The count query on users is only counting 100K rows so it is likely not enough work. The count query on the bids table (counting 10M rows) shows speedup with 3 nodes, but with 6 nodes we don’t get as much improvement. This is still a very simple query. The queries with aggregates and joins get significantly faster (23x and 8.79x) on 3 nodes. These queries also get nearly twice as fast as you go from 3-node ClustrixDB to 6-node ClustrixDB, this is because of MPP in ClustrixDB.

Overall, we see that for more complex analytical queries ClustrixDB gets significant advantage. This means reports will get much faster with ClustrixDB. For some other queries, there is not enough work or being distributed does not offer that much advantage and here the performance is about the same. For real-time analytics requirements, ClustrixDB seems like a good solution.

Note: The product used in comparison is ClustrixDB. It is available to download for FREE.

Reference: Pinal Dave (http://blog.sqlauthority.com)

About these ads

SQL – Biggest Concerns in a Data-Driven World

The ongoing chaos over Government Agency’s snooping has ignited a heated debate on privacy of personal data and its use by government and/or other institutions. It has created a feeling of disapproval and distrust among users. This incident proves to be a lesson for companies that are looking to leverage their business using a data driven approach. According to analysts, the goal of gathering personal information should be to deliver benefits to both the parties – the user as well as the data collector(government or business).

Using data the right way is crucial, and companies need to deploy the right software applications and systems to ensure that their efforts are well-directed. However, there are various issues plaguing analysts regarding available software, which are highlighted below.

According to a InformationWeek 2013 Survey of Analytics, Business Intelligence and Information Management where 541 business technology professionals contributed as respondents, it was discovered that the biggest concern was deemed to be the scarcity of expertise and high costs associated with the same. This concern was voiced by as many as 38% of the participants. A close second came out to be the issue of data warehouse appliance platforms being expensive, with 33% of those present believing it to be a huge roadblock.

Another revelation made in this respect was that 31% professionals weren’t even sure how Data Analytics can create business opportunities for them. Another 17% shared that they found data platform technologies such as Hadoop and NoSQL technologies hard to learn. These results clearly pointed out that there are awareness and expertise issues that also need much attention. Unless the demand-supply gap of Business Intelligence professionals well versed in data analysis technologies is met, this divide is going to affect how companies make the most of their BI campaigns.

One of the key action points that can be taken to salvage the situation, is to provide training on Data Analytics concepts. Koenig Solutions offer courses on many such technologies including a course on MCSE SQL Server 2012: BI Platform. So it’s time to brush up your skills and get down to work in a data driven world that awaits you ahead.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – ClustrixDB – Extreme Scale SQL Database with Real-time Analytics, Releases Software Download – NewSQL

There are so many things to learn and there is so little time we all have. As we have little time we need to be selective to learn whatever we learn. I believe I know quite a lot of things in SQL but I still do not know what is around SQL. I have started to learn about NewSQL recently. If you wonder what is NewSQL I encourage all of you to read my blog post about NewSQL over here Big Data – Buzz Words: What is NewSQL – Day 10 of 21. NewSQL databases are quickly becoming popular – providing the scale of NoSQL with the SQL features and transactions.

As a part of learning NewSQL database, I have recently started to learn about ClustrixDB. ClustrixDB has been the most mature NewSQL database used by some of the largest internet sites in the world for over 3 years, with extensive SQL support. In addition to scale, it provides fast real-time analytics by bringing massively parallel processing (MPP), available only in warehousing databases, to the transactional database.

The reason I am more intrigued about learning ClustrixDB is their recent announcement on Oct 31. ClustrixDB was only available as an appliance, but now with their software release on Oct 31, everyone can use it. It is now available as forever free for up to 12 cores with community support, and there is a 45 day trial for unlimited cluster sizes. With the forever free world, I am indeed interested in ClustrixDB now. I know that few of the leading eCommerce sites in the world uses them for their transactional database.

Here are few of the details I have quickly noted for ClustrixDB. ClustrixDB allows user to:

  • Scale by simply adding nodes to the cluster with a single command
  • Run billions of transactions a day
  • Run fast real-time analytics
  • Achieve high-availability with recovery from node failure
  • Manages itself
  • Easily migrate from MySQL as it is nearly plug-and-play compatible, use MySQL drivers, tools and replication.

While I was going through the documentation I realized that ClustrixDB also has extensive support for SQL features including complex queries involving joins on a dozen or more tables, aggregates, sorts, sub-queries. It also supports stored procedures, triggers, foreign keys, partitioned and temporary tables, and fully online schema changes. It is indeed a very matured product and SQL solution.

Indeed Clusterix sound very promising solution, I decided to dig a bit deeper to understand who are current customers of the Clustrix as they exist in the industry for quite a few years. Their client list is indeed very interesting and here is my quick research about them.

  1. Twoo.com – Europe’s largest social discovery (dating) site runs 4.4 Billion Transactions a day with table sizes over a Terabyte, on a 168 core cluster.
  2. EngageBDR – Top 3 in the online advertising category uses ClustrixDB to serve 6.9 billion ads a day through real-time bidding platform. Their reports went from 4 hours to 15 seconds.
  3. NoMoreRack – Top 2 fastest growing e-commerce company in US used ClustrixDB for high availability and fast growth through Amazon cloud.
  4. MakeMyTrip – India’s leading travel site runs on ClustrixDB with two clusters running as multi-master in Chennai and Bangalore.
  5. Many enterprises such as AOL, CSC, Rakuten, Symantec use ClustrixDB when their applications need scale.
I must accept that I am impressed with the information I have learned so far and now is the time to do some hand’s on experience with their product. I want to learn this technology so in future when it is about NewSQL, I know what I am talking about. Read more why Clustrix explains why you ClustrixDB might be the right database for you.

Download ClustrixDB
with me today and install it on your machine so in future when we discuss the technical aspects of it, we all are on the same page.

The software can be downloaded here.


This blog post is written by Nupur Dave as a guest post. Nupur will share her new learning on this blog.

Reference : Pinal Dave (http://blog.SQLAuthority.com)

Big Data – Is Big Data Relevant to me? – Big Data Questionnaires – Guest Post by Vinod Kumar

This guest post is by Vinod Kumar. Vinod Kumar has worked with SQL Server extensively since joining the industry over a decade ago. Working on various versions of SQL Server 7.0, Oracle 7.3 and other database technologies – he now works with the Microsoft Technology Center (MTC) as a Technology Architect.

Let us read the blog post in Vinod’s own voice.


I think the series from Pinal is a good one for anyone planning to start on Big Data journey from the basics. In my daily customer interactions this buzz of “Big Data” always comes up, I react generally saying – “Sir, do you really have a ‘Big Data’ problem or do you have a big Data problem?” Generally, there is a silence in the air when I ask this question. Data is everywhere in organizations – be it big data, small data, all data and for few it is bad data which is same as no data :). Wow, don’t discount me as someone who opposes “Big Data”, I am a big supporter as much as I am a critic of the abuse of this term by the people.

In this post, I wanted to let my mind flow so that you can also think in the direction I want you to see these concepts. In any case, this is not an exhaustive dump of what is in my mind – but you will surely get the drift how I am going to question Big Data terms from customers!!!

Is Big Data Relevant to me?

Many of my customers talk to me like blank whiteboard with no idea – “why Big Data”. They want to jump into the bandwagon of technology and they want to decipher insights from their unexplored data a.k.a. unstructured data with structured data. So what are these industry scenario’s that come to mind? Here are some of them:

Financials

  • Fraud detection: Banks and Credit cards are monitoring your spending habits on real-time basis.
  • Customer Segmentation: applies in every industry from Banking to Retail to Aviation to Utility and others where they deal with end customer who consume their products and services.
  • Customer Sentiment Analysis: Responding to negative brand perception on social or amplify the positive perception.
  • Sales and Marketing Campaign: Understand the impact and get closer to customer delight.
  • Call Center Analysis: attempt to take unstructured voice recordings and analyze them for content and sentiment.

Medical

  • Reduce Re-admissions: How to build a proactive follow-up engagements with patients.
  • Patient Monitoring: How to track Inpatient, Out-Patient, Emergency Visits, Intensive Care Units etc.
  • Preventive Care: Disease identification and Risk stratification is a very crucial business function for medical.
  • Claims fraud detection: There is no precise dollars that one can put here, but this is a big thing for the medical field.

Retail

  • Customer Sentiment Analysis, Customer Care Centers, Campaign Management.
  • Supply Chain Analysis: Every sensors and RFID data can be tracked for warehouse space optimization.
  • Location based marketing: Based on where a check-in happens retail stores can be optimize their marketing.

Telecom

  • Price optimization and Plans, Finding Customer churn, Customer loyalty programs
  • Call Detail Record (CDR) Analysis, Network optimizations, User Location analysis
  • Customer Behavior Analysis

Insurance

  • Fraud Detection & Analysis, Pricing based on customer
  • Sentiment Analysis, Loyalty Management
  • Agents Analysis, Customer Value Management

This list can go on to other areas like Utility, Manufacturing, Travel, ITES etc. So as you can see, there are obviously interesting use cases for each of these industry verticals. These are just representative list.

Where to start?

A lot of times I try to quiz customers on a number of dimensions before starting a Big Data conversation.

  • Are you getting the data you need the way you want it and in a timely manner?
  • Can you get in and analyze the data you need?
  • How quickly is IT to respond to your BI Requests?
  • How easily can you get at the data that you need to run your business/department/project?
  • How are you currently measuring your business?
  • Can you get the data you need to react WITHIN THE QUARTER to impact behaviors to meet your numbers or is it always “rear-view mirror?”
  • How are you measuring:
    • The Brand
    • Customer Sentiment
    • Your Competition
    • Your Pricing
    • Your performance
    • Supply Chain Efficiencies
    • Predictive product / service positioning
    • What are your key challenges of driving collaboration across your global business?  What the challenges in innovation?
    • What challenges are you facing in getting more information out of your data?

Note: Garbage-in is Garbage-out. Hold good for all reporting / analytics requirements

Big Data POCs?

A number of customers get into the realm of setting a small team to work on Big Data – well it is a great start from an understanding point of view, but I tend to ask a number of other questions to such customers. Some of these common questions are:

  1. To what degree is your advanced analytics (natural language processing, sentiment analysis, predictive analytics and classification) paired with your Big Data’s efforts?
  2. Do you have dedicated resources exploring the possibilities of advanced analytics in Big Data for your business line?
  3. Do you plan to employ machine learning technology while doing Advanced Analytics?
  4. How is Social Media being monitored in your organization?
  5. What is your ability to scale in terms of storage and processing power?
  6. Do you have a system in place to sort incoming data in near real time by potential value, data quality, and use frequency?
  7. Do you use event-driven architecture to manage incoming data?
  8. Do you have specialized data services that can accommodate different formats, security, and the management requirements of multiple data sources?
  9. Is your organization currently using or considering in-memory analytics?
  10. To what degree are you able to correlate data from your Big Data infrastructure with that from your enterprise data warehouse?
  11. Have you extended the role of Data Stewards to include ownership of big data components?
  12. Do you prioritize data quality based on the source system (that is Facebook/Twitter data has lower quality thresholds than radio frequency identification (RFID) for a tracking system)?
  13. Do your retention policies consider the different legal responsibilities for storing Big Data for a specific amount of time?
  14. Do Data Scientists work in close collaboration with Data Stewards to ensure data quality?
  15. How is access to attributes of Big Data being given out in the organization?
  16. Are roles related to Big Data (Advanced Analyst, Data Scientist) clearly defined?
  17. How involved is risk management in the Big Data governance process?
  18. Is there a set of documented policies regarding Big Data governance?
  19. Is there an enforcement mechanism or approach to ensure that policies are followed?
  20. Who is the key sponsor for your Big Data governance program? (The CIO is best)
  21. Do you have defined policies surrounding the use of social media data for potential employees and customers, as well as the use of customer Geo-location data?
  22. How accessible are complex analytic routines to your user base?
  23. What is the level of involvement with outside vendors and third parties in regard to the planning and execution of Big Data projects?
  24. What programming technologies are utilized by your data warehouse/BI staff when working with Big Data?

These are some of the important questions I ask each customer who is actively evaluating Big Data trends for their organizations. These questions give you a sense of direction where to start, what to use, how to secure, how to analyze and more.

Sign off

Any Big data is analysis is incomplete without a compelling story. The best way to understand this is to watch Hans Rosling – Gapminder (2:17 to 6:06) videos about the third world myths. Don’t get overwhelmed with the Big Data buzz word, the destination to what your data speaks is important.

In this blog post, we did not particularly look at any Big Data technologies. This is a set of questionnaire one needs to keep in mind as they embark their journey of Big Data. I did write some of the basics in my blog: Big Data – Big Hype yet Big Opportunity. Do let me know if these questions make sense?

 Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Learning Basics of Big Data in 21 Days – Bookmark

Earlier this month I had a great time to write Bascis of Big Data series. This series received great response and lots of good comments I have received, I am going to follow up this basics series with further in-depth series in near future. Here is the consolidated blog post where you can find all the 21 days blog posts together. Bookmark this page for future reference.

Big Data – Beginning Big Data – Day 1 of 21

Big Data – What is Big Data – 3 Vs of Big Data – Volume, Velocity and Variety – Day 2 of 21

Big Data – Evolution of Big Data – Day 3 of 21

Big Data – Basics of Big Data Architecture – Day 4 of 21

Big Data – Buzz Words: What is NoSQL – Day 5 of 21

Big Data – Buzz Words: What is Hadoop – Day 6 of 21

Big Data – Buzz Words: What is MapReduce – Day 7 of 21

Big Data – Buzz Words: What is HDFS – Day 8 of 21

Big Data – Buzz Words: Importance of Relational Database in Big Data World – Day 9 of 21

Big Data – Buzz Words: What is NewSQL – Day 10 of 21

Big Data – Role of Cloud Computing in Big Data – Day 11 of 21

Big Data – Operational Databases Supporting Big Data – RDBMS and NoSQL – Day 12 of 21

Big Data – Operational Databases Supporting Big Data – Key-Value Pair Databases and Document Databases – Day 13 of 21

Big Data – Operational Databases Supporting Big Data – Columnar, Graph and Spatial Database – Day 14 of 21

Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21

Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21

Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21

Big Data – Basics of Big Data Analytics – Day 18 of 21

Big Data – How to become a Data Scientist and Learn Data Science? – Day 19 of 21

Big Data – Various Learning Resources – How to Start with Big Data? – Day 20 of 21

Big Data – Final Wrap and What Next – Day 21 of 21

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Final Wrap and What Next – Day 21 of 21

In yesterday’s blog post we explored various resources related to learning Big Data and in this blog post we will wrap up this 21 day series on Big Data.

I have been exploring various terms and technology related to Big Data this entire month. It was indeed fun to write about Big Data in 21 days but the subject of Big Data is much bigger and larger than someone can cover it in 21 days. My first goal was to write about the basics and I think we have got that one covered pretty well. During this 21 days I have received many questions and answers related to Big Data. I have covered a few of the questions in this series and a few more I will be covering in the next coming months.

Now after understanding Big Data basics. I am personally going to do a list of the things next. I thought I will share the same with you as this will give you a good idea how to continue the journey of the Big Data.

  • Build a schedule to read various Apache documentations
  • Watch all Pluralsight Courses
  • Explore HortonWorks Sandbox
  • Start building presentation about Big Data – this is a great way to learn something new
  • Present in User Groups Meetings on Big Data Topics
  • Write more blog posts about Big Data

I am going to continue learning about Big Data – I want you to continue learning Big Data. Please leave a comment how you are going to continue learning about Big Data. I will publish all the informative comments on this blog with due credit. I want to end this series with the infographic by UMUC.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Big Data – Various Learning Resources – How to Start with Big Data? – Day 20 of 21

In yesterday’s blog post we learned how to become a Data Scientist for Big Data. In this article we will go over various learning resources related to Big Data.

In this series we have covered many of the most essential details about Big Data. At the beginning of this series, I have encouraged readers to send me questions. One of the most popular questions is -

“I want to learn more about Big Data. Where can I learn it?”

This is indeed a great question as there are plenty of resources out to learn about Big Data and it is indeed difficult to select on one resource to learn Big Data. Hence I decided to write here a few of the very important resources which are related to Big Data.

Learn from Pluralsight

Pluralsight is a global leader in high-quality online training for hardcore developers.  It has fantastic Big Data Courses and I started to learn about Big Data with the help of Pluralsight. Here are few of the courses which are directly related to Big Data.

I encourage all of you start with this video course as they are fantastic fundamentals to learn Big Data.

Learn from Apache

Resources at Apache are single point the most authentic learning resources. If you want to learn fundamentals and go deep about every aspect of the Big Data, I believe you must understand various concepts in Apache’s library. I am pretty impressed with the documentation and I am personally referencing it every single day when I work with Big Data. I strongly encourage all of you to bookmark following all the links for authentic big data learning.

  • Haddop - The Apache Hadoop® project develops open-source software for reliable, scalable, distributed computing.
  • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which include support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heat maps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.
  • Avro: A data serialization system.
  • Cassandra: A scalable multi-master database with no single points of failure.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase: A scalable, distributed database that supports structured data storage for large tables.
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout: A Scalable machine learning and data mining library.
  • Pig: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper: A high-performance coordination service for distributed applications.

Learn from Vendors

One of the biggest issues with about learning Big Data is setting up the environment. Every Big Data vendor has different environment request and there are lots of things require to set up Big Data framework. Many of the users do not start with Big Data as they are afraid about the resources required to set up framework as well as a time commitment. Here Hortonworks have created fantastic learning environment. They have created Sandbox with everything one person needs to learn Big Data and also have provided excellent tutoring along with it. Sandbox comes with a dozen hands-on tutorial that will guide you through the basics of Hadoop as well it contains the Hortonworks Data Platform.

I think Hortonworks did a fantastic job building this Sandbox and Tutorial. Though there are plenty of different Big Data Vendors I have decided to list only Hortonworks due to their unique setup. Please leave a comment if there are any other such platform to learn Big Data. I will include them over here as well.

Learn from Books

There are indeed few good books out there which one can refer to learn Big Data. Here are few good books which I have read. I will update the list as I will learn more.

If you search on Amazon there are millions of the books but I think above three books are a great set of books and it will give you great ideas about Big Data. Once you go through above books, you will have a clear idea about what is the next step you should follow in this series. You will be capable enough to make the right decision for yourself.

Tomorrow

In tomorrow’s blog post we will wrap up this series of Big Data.

Reference: Pinal Dave (http://blog.sqlauthority.com)