SQL SERVER – Introduction to Big Data – Guest Post

BIG Data – such a big word – everybody talks about this now a days. It is the word in the database world. In one of the conversation I asked my friend Jasjeet Sigh the same question – what is Big Data? He instantly came up with a very effective write-up.  Jasjeet is working as a Technical Manager with Koenig Solutions. He leads the SQL domain, and holds rich IT industry experience. Talking about Koenig, it is a 19 year old IT training company that offers several certification choices. Some of its courses include SharePoint Training, Project Management certifications, Microsoft Trainings, Business Intelligence programs, Web Design and Development courses etc.


Big Data, as the name suggests, is about data that is BIG in nature. The data is BIG in terms of size, and it is difficult to manage such enormous data with relational database management systems that are quite popular these days.

Big Data is not just about being large in size, it is also about the variety of the data that differs in form or type. Some examples of Big Data are given below :

  • Scientific data related to weather and atmosphere, Genetics etc
  • Data collected by various medical procedures, such as Radiology, CT scan, MRI etc
  • Data related to Global Positioning System
  • Pictures and Videos
  • Radio Frequency Data
  • Data that may vary very rapidly like stock exchange information

Apart from difficulties in managing and storing such data, it is difficult to query, analyze and visualize it.

The characteristics of Big Data can be defined by four Vs:

  1. Volume: It simply means a large volume of data that may span Petabyte, Exabyte and so on. However it also depends organization to organization that what volume of data they consider as Big Data.
  2. Variety: As discussed above, Big Data is not limited to relational information or structured Data. It can also include unstructured data like pictures, videos, text, audio etc.
  3. Velocity:  Velocity means the speed by which data changes. The higher is the velocity, the more efficient should be the system to capture and analyze the data. Missing any important point may lead to wrong analysis or may even result in loss.
  4. Veracity: It has been recently added as the fourth V, and generally means truthfulness or adherence to the truth. In terms of Big Data, it is more of a challenge than a characteristic. It is difficult to ascertain the truth out of the enormous amount of data and the one that has high velocity. There are always chances of having un-precise and uncertain data. It is a challenging task to clean such data before it is analyzed.

Big Data can be considered as the next big thing in the IT sector in terms of innovation and development. If appropriate technologies are developed to analyze and use the information, it can be the driving force for almost all industrial segments. These include Retail, Manufacturing, Service, Finance, Healthcare etc. This will help them to automate business decisions, increase productivity, and innovate and develop new products.


Thanks Jasjeet Singh for an excellent write up.  Jasjeet Sign is working as a Technical Manager with Koenig Solutions.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

About these ads

SQL SERVER – What is Big Data – An Explanation in Simple Words

Decoding the human genome originally took 10 years to process; now it can be achieved in one week – The Economist.

This blog post is written in response to the T-SQL Tuesday post of The Big Data. This is a very interesting subject. Data is growing every single day. I remember my first computer which had 1 GB of the Harddrive. I had told my dad that I will never need any more hard drive, we are good for next 10 years. I bought much larger  Harddrive after 2 years and today I have NAS at home which can hold 2 TB and have few file hosting in the cloud as well. Well the point is, amount of the data any individual deals with has increased significantly.

There was a time of floppy drives. Today some of the auto correct software even does not recognize that word. However, USB drive, Pen drives and Jump drives are common names across industry. It is race – I really do not know where it will stop.

Big Data

Same way the amount of the data has grown so wild that relational database is not able to handle the processing of this amount of the data. Conventional RDBMS faces challenges to process and analysis data beyond certain very large data. Big Data is large amount of the data which is difficult or impossible for traditional relational database. Current moving target limits for Big data is terabytes, exabytes and zettabytes.

Hadoop

Hadoop is a software framework which supports data intensive processes and enables applications to work with Big Data. Technically it is inspired by MapReduces technology, however there is very interesting story behind its name. The creator of the Hadoop had named it Hadoop because his son’s toy elephant was named Hadoop. For the same reasons, the logo of the Hadoop is yellow toy elephant.

There are two very famous companies uses Hadoop to process their large data – Facebook and Yahoo. Hadoop platform can solve problems where the deep analysis is complex and unstructured but needs to be done in reasonable time.

Hadoop is architectured to run on a large number of machines where ‘shared nothing’ is the architecture. All the independent server can be put use by Hadoop technology. Hadoop technology maintains and manages the data among all the independent servers. Individual user can not directly gain the access to the data as data is divided among this servers. Additionally, a single data can be shared on multiple server which gives availability of the data in case of the disaster or single machine failure. Hadoop uses MapReduce software framework to return unified data.

MapReduce

This technology is much simpler conceptually but very powerful when put along with Hadoop framework. There are two major steps: 1) Map 2) Reduce.

In Map step master node takes input and divides into simple smaller chunks and provides it to other worker node. In Reduce step it collects all the small solution of the problem and returns as output in one unified answer. Both of this steps uses function which relies on Key-Value pairs. This process runs on the various nodes in parallel and brings faster results for framework.

Pigs and Hives

Pig is high level platform for creating MapReduce programs with Hadoop. Hive is a data warehouse infrastructure built for Hadoop for analysis and aggregation (summary of the data) of the data. Both of this commands are compilation of the MapReduce commands. Pig procedure language where one describes procedures to apply on the Hadoop. Hives is SQL-like declarative language. Yahoo uses Pigs and Hives both in their Hadoop Toolkit. Here is excellent resource from Lars George where he has compared both of this in detail.

Microsoft and Big Data

Microsoft is committed to making Hadoop accessible to a broader class of end users, developers and IT professionals. Accelerate your Hadoop deployment through simplicity of Hadoop on Windows, and the use of familiar Microsoft products.

  • Apache Hadoop connector for Microsoft SQL Server
  • Apache Hadoop connector for Microsoft Parallel DataWarehouse

Here is the link for further reading.

Most Important

I can not end this blog post if I do not talk about the one man from whom I have heard about Big Data very first time.

Pinal Dave with Dr. David DeWitt

Pinal Dave with Dr. David DeWitt

… and of-course – Happy Valentines Day!

Reference: Pinal Dave (http://blog.sqlauthority.com)