Big data is one of the most popular subject in recent time and everybody wants to get started on this subject. During recent interviews there are plenty of the questions with related to Big Data. Here is the most popular question which I receive on this subject.
Question: How to get started with Big Data?
Answer: Earlier last year I wrote timeless series on the subject Big Data. Here is the link to the entire series.
Who can predict which customer is going to cancel a newspaper subscription before it actually happens? A soothsayer, right? Wrong!It’s aData Scientist.
The New York Times recently hired a Chief Data Scientist for this very purpose in order to save the publication from its dwindling position in the industry. The new Data Science Training team in the company is working to create a data model that will predict which customer is going to cancel subscription based on insights about what makes customers stay and how to retain them. Not just New York Times, but a whole lot of companies who are looking to manage Big Data are now hunting for data scientists to rescue them from this vast ocean of information.
What is Data Science?
Data Scientists are experts who possess multifarious skills to design and develop complex algorithms, models, and visualizations that allow enterprises to extract useful insights from large amounts of data.
Data Science is a discipline that incorporates theories and studies from various fields including:
Also, it’s not just data crunching, this field entails deep understanding of business challenges, uncovering valuable insights and communicating these to management for appropriate action.
Hottest Career Options 2014
This career option is gaining popularity and is listed as one of the most wanted skills for 2014. According to a study, 80% of open data scientist positions created in the past two years have not yet been filled. Koenig Solutions offers a one of its kind course on this subject for those who wish to expand their professional horizons. Since it is a relatively new field, training boot camps offering a conventional curriculum are not readily available. However, Koenig has specifically designed this course to include all essential skills requisite of a seasoned data scientist.
So, get set to take up the challenge and step forward on the path of an exciting career ahead.
Note: The product used in comparison is ClustrixDB. It is available to download for FREE.
NewSQL databases provide scale-out of NoSQL without giving up on SQL or ACID transactions. While most NewSQL databases focus only on transactions, ClustrixDB also provides fast real-time analytics that are becoming increasing important to many businesses. ClustrixDB does this by bringing Massively Parallel Processing (MPP) used in data warehouses, to the primary database.
So, I decided to get a workload and try it out to see what kind of performance improvements one can get, if any. Since, joins and aggregates are the workhorses of real-time analytics processing, they are a good place to start.
I built a simple dataset with two tables USERS (100K rows), USER_ADDRESSES (200K rows) and BIDS (10M rows) so this dataset has 2GB of data (mysqldump). For platform I used AWS and got ClustrixDB from AWS Marketplace. For comparison, I decided to use MySQL 5.6 since the exact same data and queries can be run on both databases. For both databases, the instance types are m1.xlarge.
MySQL does not scale beyond a single server and is usually deployed with master and two read slaves. Since ClustrixDB provides horizontal scale-out within one cluster, rather than master-slave (with multiple copies of data), the equivalent configuration is 3 servers. ClustrixDB horizontal scaling allows all nodes to participate in all query types. For measuring performance single MySQL is enough because performance for one query will be the same – whether we use the master or read slave.
For ClustrixDB, I also tried out 6 servers to see if analytics get faster as you add servers.
Here is the resulting table:
We see that some queries get significantly faster, however one query showed no performance improvement. The count query on users is only counting 100K rows so it is likely not enough work. The count query on the bids table (counting 10M rows) shows speedup with 3 nodes, but with 6 nodes we don’t get as much improvement. This is still a very simple query. The queries with aggregates and joins get significantly faster (23x and 8.79x) on 3 nodes. These queries also get nearly twice as fast as you go from 3-node ClustrixDB to 6-node ClustrixDB, this is because of MPP in ClustrixDB.
Overall, we see that for more complex analytical queries ClustrixDB gets significant advantage. This means reports will get much faster with ClustrixDB. For some other queries, there is not enough work or being distributed does not offer that much advantage and here the performance is about the same. For real-time analytics requirements, ClustrixDB seems like a good solution.
The ongoing chaos over Government Agency’s snooping has ignited a heated debate on privacy of personal data and its use by government and/or other institutions. It has created a feeling of disapproval and distrust among users. This incident proves to be a lesson for companies that are looking to leverage their business using a data driven approach. According to analysts, the goal of gathering personal information should be to deliver benefits to both the parties – the user as well as the data collector(government or business).
Using data the right way is crucial, and companies need to deploy the right software applications and systems to ensure that their efforts are well-directed. However, there are various issues plaguing analysts regarding available software, which are highlighted below.
According to a InformationWeek 2013 Survey of Analytics, Business Intelligence and Information Management where 541 business technology professionals contributed as respondents, it was discovered that the biggest concern was deemed to be the scarcity of expertise and high costs associated with the same. This concern was voiced by as many as 38% of the participants. A close second came out to be the issue of data warehouse appliance platforms being expensive, with 33% of those present believing it to be a huge roadblock.
Another revelation made in this respect was that 31% professionals weren’t even sure how Data Analytics can create business opportunities for them. Another 17% shared that they found data platform technologies such as Hadoop and NoSQL technologies hard to learn. These results clearly pointed out that there are awareness and expertise issues that also need much attention. Unless the demand-supply gap of Business Intelligence professionals well versed in data analysis technologies is met, this divide is going to affect how companies make the most of their BI campaigns.
One of the key action points that can be taken to salvage the situation, is to provide training on Data Analytics concepts. Koenig Solutions offer courses on many such technologies including a course on MCSE SQL Server 2012: BI Platform. So it’s time to brush up your skills and get down to work in a data driven world that awaits you ahead.
There are so many things to learn and there is so little time we all have. As we have little time we need to be selective to learn whatever we learn. I believe I know quite a lot of things in SQL but I still do not know what is around SQL. I have started to learn about NewSQL recently. If you wonder what is NewSQL I encourage all of you to read my blog post about NewSQL over here Big Data – Buzz Words: What is NewSQL – Day 10 of 21. NewSQL databases are quickly becoming popular – providing the scale of NoSQL with the SQL features and transactions.
As a part of learning NewSQL database, I have recently started to learn about ClustrixDB. ClustrixDB has been the most mature NewSQL database used by some of the largest internet sites in the world for over 3 years, with extensive SQL support. In addition to scale, it provides fast real-time analytics by bringing massively parallel processing (MPP), available only in warehousing databases, to the transactional database.
The reason I am more intrigued about learning ClustrixDB is their recent announcement on Oct 31. ClustrixDB was only available as an appliance, but now with their software release on Oct 31, everyone can use it. It is now available as forever free for up to 12 cores with community support, and there is a 45 day trial for unlimited cluster sizes. With the forever free world, I am indeed interested in ClustrixDB now. I know that few of the leading eCommerce sites in the world uses them for their transactional database.
Here are few of the details I have quickly noted for ClustrixDB. ClustrixDB allows user to:
Scale by simply adding nodes to the cluster with a single command
Run billions of transactions a day
Run fast real-time analytics
Achieve high-availability with recovery from node failure
Easily migrate from MySQL as it is nearly plug-and-play compatible, use MySQL drivers, tools and replication.
While I was going through the documentation I realized that ClustrixDB also has extensive support for SQL features including complex queries involving joins on a dozen or more tables, aggregates, sorts, sub-queries. It also supports stored procedures, triggers, foreign keys, partitioned and temporary tables, and fully online schema changes. It is indeed a very matured product and SQL solution.
Indeed Clusterix sound very promising solution, I decided to dig a bit deeper to understand who are current customers of the Clustrix as they exist in the industry for quite a few years. Their client list is indeed very interesting and here is my quick research about them.
Twoo.com – Europe’s largest social discovery (dating) site runs 4.4 Billion Transactions a day with table sizes over a Terabyte, on a 168 core cluster.
EngageBDR – Top 3 in the online advertising category uses ClustrixDB to serve 6.9 billion ads a day through real-time bidding platform. Their reports went from 4 hours to 15 seconds.
NoMoreRack – Top 2 fastest growing e-commerce company in US used ClustrixDB for high availability and fast growth through Amazon cloud.
MakeMyTrip – India’s leading travel site runs on ClustrixDB with two clusters running as multi-master in Chennai and Bangalore.
Many enterprises such as AOL, CSC, Rakuten, Symantec use ClustrixDB when their applications need scale.
I must accept that I am impressed with the information I have learned so far and now is the time to do some hand’s on experience with their product. I want to learn this technology so in future when it is about NewSQL, I know what I am talking about. Read more why Clustrix explains why you ClustrixDB might be the right database for you.
Download ClustrixDB with me today and install it on your machine so in future when we discuss the technical aspects of it, we all are on the same page.
This guest post is by Vinod Kumar. Vinod Kumar has worked with SQL Server extensively since joining the industry over a decade ago. Working on various versions of SQL Server 7.0, Oracle 7.3 and other database technologies – he now works with the Microsoft Technology Center (MTC) as a Technology Architect.
Let us read the blog post in Vinod’s own voice.
I think the series from Pinal is a good one for anyone planning to start on Big Data journey from the basics. In my daily customer interactions this buzz of “Big Data” always comes up, I react generally saying – “Sir, do you really have a ‘Big Data’ problem or do you have a big Data problem?” Generally, there is a silence in the air when I ask this question. Data is everywhere in organizations – be it big data, small data, all data and for few it is bad data which is same as no data :). Wow, don’t discount me as someone who opposes “Big Data”, I am a big supporter as much as I am a critic of the abuse of this term by the people.
In this post, I wanted to let my mind flow so that you can also think in the direction I want you to see these concepts. In any case, this is not an exhaustive dump of what is in my mind – but you will surely get the drift how I am going to question Big Data terms from customers!!!
Is Big Data Relevant to me?
Many of my customers talk to me like blank whiteboard with no idea – “why Big Data”. They want to jump into the bandwagon of technology and they want to decipher insights from their unexplored data a.k.a. unstructured data with structured data. So what are these industry scenario’s that come to mind? Here are some of them:
Fraud detection: Banks and Credit cards are monitoring your spending habits on real-time basis.
Customer Segmentation: applies in every industry from Banking to Retail to Aviation to Utility and others where they deal with end customer who consume their products and services.
Customer Sentiment Analysis: Responding to negative brand perception on social or amplify the positive perception.
Sales and Marketing Campaign: Understand the impact and get closer to customer delight.
Call Center Analysis: attempt to take unstructured voice recordings and analyze them for content and sentiment.
Reduce Re-admissions: How to build a proactive follow-up engagements with patients.
Patient Monitoring: How to track Inpatient, Out-Patient, Emergency Visits, Intensive Care Units etc.
Preventive Care: Disease identification and Risk stratification is a very crucial business function for medical.
Claims fraud detection: There is no precise dollars that one can put here, but this is a big thing for the medical field.
Customer Sentiment Analysis, Customer Care Centers, Campaign Management.
Supply Chain Analysis: Every sensors and RFID data can be tracked for warehouse space optimization.
Location based marketing: Based on where a check-in happens retail stores can be optimize their marketing.
Price optimization and Plans, Finding Customer churn, Customer loyalty programs
Call Detail Record (CDR) Analysis, Network optimizations, User Location analysis
Customer Behavior Analysis
Fraud Detection & Analysis, Pricing based on customer
Sentiment Analysis, Loyalty Management
Agents Analysis, Customer Value Management
This list can go on to other areas like Utility, Manufacturing, Travel, ITES etc. So as you can see, there are obviously interesting use cases for each of these industry verticals. These are just representative list.
Where to start?
A lot of times I try to quiz customers on a number of dimensions before starting a Big Data conversation.
Are you getting the data you need the way you want it and in a timely manner?
Can you get in and analyze the data you need?
How quickly is IT to respond to your BI Requests?
How easily can you get at the data that you need to run your business/department/project?
How are you currently measuring your business?
Can you get the data you need to react WITHIN THE QUARTER to impact behaviors to meet your numbers or is it always “rear-view mirror?”
How are you measuring:
Supply Chain Efficiencies
Predictive product / service positioning
What are your key challenges of driving collaboration across your global business? What the challenges in innovation?
What challenges are you facing in getting more information out of your data?
Note: Garbage-in is Garbage-out. Hold good for all reporting / analytics requirements
Big Data POCs?
A number of customers get into the realm of setting a small team to work on Big Data – well it is a great start from an understanding point of view, but I tend to ask a number of other questions to such customers. Some of these common questions are:
To what degree is your advanced analytics (natural language processing, sentiment analysis, predictive analytics and classification) paired with your Big Data’s efforts?
Do you have dedicated resources exploring the possibilities of advanced analytics in Big Data for your business line?
Do you plan to employ machine learning technology while doing Advanced Analytics?
How is Social Media being monitored in your organization?
What is your ability to scale in terms of storage and processing power?
Do you have a system in place to sort incoming data in near real time by potential value, data quality, and use frequency?
Do you use event-driven architecture to manage incoming data?
Do you have specialized data services that can accommodate different formats, security, and the management requirements of multiple data sources?
Is your organization currently using or considering in-memory analytics?
To what degree are you able to correlate data from your Big Data infrastructure with that from your enterprise data warehouse?
Have you extended the role of Data Stewards to include ownership of big data components?
Do you prioritize data quality based on the source system (that is Facebook/Twitter data has lower quality thresholds than radio frequency identification (RFID) for a tracking system)?
Do your retention policies consider the different legal responsibilities for storing Big Data for a specific amount of time?
Do Data Scientists work in close collaboration with Data Stewards to ensure data quality?
How is access to attributes of Big Data being given out in the organization?
Are roles related to Big Data (Advanced Analyst, Data Scientist) clearly defined?
How involved is risk management in the Big Data governance process?
Is there a set of documented policies regarding Big Data governance?
Is there an enforcement mechanism or approach to ensure that policies are followed?
Who is the key sponsor for your Big Data governance program? (The CIO is best)
Do you have defined policies surrounding the use of social media data for potential employees and customers, as well as the use of customer Geo-location data?
How accessible are complex analytic routines to your user base?
What is the level of involvement with outside vendors and third parties in regard to the planning and execution of Big Data projects?
What programming technologies are utilized by your data warehouse/BI staff when working with Big Data?
These are some of the important questions I ask each customer who is actively evaluating Big Data trends for their organizations. These questions give you a sense of direction where to start, what to use, how to secure, how to analyze and more.
Any Big data is analysis is incomplete without a compelling story. The best way to understand this is to watch Hans Rosling – Gapminder (2:17 to 6:06) videos about the third world myths. Don’t get overwhelmed with the Big Data buzz word, the destination to what your data speaks is important.
In this blog post, we did not particularly look at any Big Data technologies. This is a set of questionnaire one needs to keep in mind as they embark their journey of Big Data. I did write some of the basics in my blog: Big Data – Big Hype yet Big Opportunity. Do let me know if these questions make sense?
Earlier this month I had a great time to write Bascis of Big Data series. This series received great response and lots of good comments I have received, I am going to follow up this basics series with further in-depth series in near future. Here is the consolidated blog post where you can find all the 21 days blog posts together. Bookmark this page for future reference.
In yesterday’s blog post we explored various resources related to learning Big Data and in this blog post we will wrap up this 21 day series on Big Data.
I have been exploring various terms and technology related to Big Data this entire month. It was indeed fun to write about Big Data in 21 days but the subject of Big Data is much bigger and larger than someone can cover it in 21 days. My first goal was to write about the basics and I think we have got that one covered pretty well. During this 21 days I have received many questions and answers related to Big Data. I have covered a few of the questions in this series and a few more I will be covering in the next coming months.
Now after understanding Big Data basics. I am personally going to do a list of the things next. I thought I will share the same with you as this will give you a good idea how to continue the journey of the Big Data.
Build a schedule to read various Apache documentations
Watch all Pluralsight Courses
Explore HortonWorks Sandbox
Start building presentation about Big Data – this is a great way to learn something new
Present in User Groups Meetings on Big Data Topics
Write more blog posts about Big Data
I am going to continue learning about Big Data – I want you to continue learning Big Data. Please leave a comment how you are going to continue learning about Big Data. I will publish all the informative comments on this blog with due credit. I want to end this series with the infographic by UMUC.
In yesterday’s blog post we learned how to become a Data Scientist for Big Data. In this article we will go over various learning resources related to Big Data.
In this series we have covered many of the most essential details about Big Data. At the beginning of this series, I have encouraged readers to send me questions. One of the most popular questions is –
“I want to learn more about Big Data. Where can I learn it?”
This is indeed a great question as there are plenty of resources out to learn about Big Data and it is indeed difficult to select on one resource to learn Big Data. Hence I decided to write here a few of the very important resources which are related to Big Data.
Learn from Pluralsight
Pluralsight is a global leader in high-quality online training for hardcore developers. It has fantastic Big Data Courses and I started to learn about Big Data with the help of Pluralsight. Here are few of the courses which are directly related to Big Data.
I encourage all of you start with this video course as they are fantastic fundamentals to learn Big Data.
Learn from Apache
Resources at Apache are single point the most authentic learning resources. If you want to learn fundamentals and go deep about every aspect of the Big Data, I believe you must understand various concepts in Apache’s library. I am pretty impressed with the documentation and I am personally referencing it every single day when I work with Big Data. I strongly encourage all of you to bookmark following all the links for authentic big data learning.
Haddop – The Apache Hadoop® project develops open-source software for reliable, scalable, distributed computing.
Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which include support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heat maps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.
Learn from Vendors
One of the biggest issues with about learning Big Data is setting up the environment. Every Big Data vendor has different environment request and there are lots of things require to set up Big Data framework. Many of the users do not start with Big Data as they are afraid about the resources required to set up framework as well as a time commitment. Here Hortonworks have created fantastic learning environment. They have created Sandbox with everything one person needs to learn Big Data and also have provided excellent tutoring along with it. Sandbox comes with a dozen hands-on tutorial that will guide you through the basics of Hadoop as well it contains the Hortonworks Data Platform.
I think Hortonworks did a fantastic job building this Sandboxand Tutorial. Though there are plenty of different Big Data Vendors I have decided to list only Hortonworks due to their unique setup. Please leave a comment if there are any other such platform to learn Big Data. I will include them over here as well.
Learn from Books
There are indeed few good books out there which one can refer to learn Big Data. Here are few good books which I have read. I will update the list as I will learn more.
If you search on Amazon there are millions of the books but I think above three books are a great set of books and it will give you great ideas about Big Data. Once you go through above books, you will have a clear idea about what is the next step you should follow in this series. You will be capable enough to make the right decision for yourself.
In tomorrow’s blog post we will wrap up this series of Big Data.
In yesterday’s blog post we learned the importance of the analytics in Big Data Story. In this article we will understand how to become a Data Scientist for Big Data Story.
Data Scientist is a new buzz word, everyone seems to be wanting to become Data Scientist. Let us go over a few key topics related to Data Scientist in this blog post. First of all we will understand what is a Data Scientist.
In the new world of Big Data, I see pretty much everyone wants to become Data Scientist and there are lots of people I have already met who claims that they are Data Scientist. When I ask what is their role, I have got a wide variety of answers.
What is Data Scientist?
Data scientists are the experts who understand various aspects of the business and know how to strategies data to achieve the business goals. They should have a solid foundation of various data algorithms, modeling and statistics methodology.
What do Data Scientists do?
Data scientists understand the data very well. They just go beyond the regular data algorithms and builds interesting trends from available data. They innovate and resurrect the entire new meaning from the existing data. They are artists in disguise of computer analyst. They look at the data traditionally as well as explore various new ways to look at the data.
Data Scientists do not wait to build their solutions from existing data. They think creatively, they think before the data has entered into the system. Data Scientists are visionary experts who understands the business needs and plan ahead of the time, this tremendously help to build solutions at rapid speed.
Besides being data expert, the major quality of Data Scientists is “curiosity”. They always wonder about what more they can get from their existing data and how to get maximum out of future incoming data.
Data Scientists do wonders with the data, which goes beyond the job descriptions of Data Analysist or Business Analysist.
Skills Required for Data Scientists
Here are few of the skills a Data Scientist must have.
Expert level skills with statistical tools like SAS, Excel, R etc.
Understanding Mathematical Models
Hands-on with Visualization Tools like Tableau, PowerPivots, D3. j’s etc.
Analytical skills to understand business needs
On the technology front any Data Scientists should know underlying technologies like (Hadoop, Cloudera) as well as their entire ecosystem (programming language, analysis and visualization tools etc.) .
Remember that for becoming a successful Data Scientist one require have par excellent skills, just having a degree in a relevant education field will not suffice.
Data Scientists is indeed very exciting job profile. As per research there are not enough Data Scientists in the world to handle the current data explosion. In near future Data is going to expand exponentially, and the need of the Data Scientists will increase along with it. It is indeed the job one should focus if you like data and science of statistics.
In tomorrow’s blog post we will discuss about various Big Data Learning resources.