Let us start with a very interesting quote for Big Data.
Decoding the human genome originally took 10 years to process; now it can be achieved in one week – The Economist.Â
This blog post is written in response to the T-SQL Tuesday post of The Big Data. This is a very interesting subject. Data is growing every single day. I remember my first computer which had 1 GB of the Hard Drive. I had told my dad that I will never need any more hard drive, we are good for next 10 years. I bought much larger  Harddrive over 2 years and today I have a NAS at home, which can hold 2 TB and have few file hosting in the cloud as well. Well the point is, the amount of the data any individual deals with has increased significantly.
There was a time of floppy drives. Today, some of the auto correct software even does not recognize that word. However, USB drive, Pen drives and Jump drives are common names across the industry. It is race – I really do not know where it will stop.
Big Data
The same way the amount of the data has grown so wild that a relational database is not able to handle the processing of this amount of the data. Conventional RDBMS faces challenges to process and analysis data beyond certain very large data. Big Data is a large amount of the data which is difficult or impossible for traditional relational database. Current moving target limits for Big data is terabytes, Exabytes and zettabytes.
Hadoop
Hadoop is a software framework which supports data intensive processes and enables applications to work with Big Data. Technically, it is inspired by MapReduces technology, however, there is a very interesting story behind its name. The creator of the Hadoop had named it Hadoop because his son’s toy elephant was named Hadoop. For the same reasons, the logo of the Hadoop is a yellow toy elephant.
There are two very famous companies using Hadoop to process their large data – Facebook and Yahoo. Hadoop platform can solve problems where deeper analysis is complex and unstructured, but needs to be done in reasonable time.
Hadoop is architectured to run on a large number of machines where ‘shared nothing’ is the architecture. All the independent server can be put use by Hadoop technology. Hadoop technology maintains and manages the data among all the independent servers. Individual users cannot directly gain the access to the data as data is divided among this server. Additionally, a single data can be shared on multiple server, which gives availability of the data in case of the disaster or single machine failure. Hadoop uses MapReduce software framework to return unified data.
MapReduce
This technology is much simpler conceptually, but very powerful when put along with Hadoop framework. There are two major steps: 1) Map 2) Reduce.
On the Map step master node takes input and divides into simple smaller chunks, and provides it to the other worker node. In Reduce step it collects all the small solution of the problem and returns as output in one unified answer. Both of these steps uses functions which relies on Key-Value pairs. This process runs on the various nodes in parallel and brings faster results for frame work.
Pigs and Hives
The pig is a high level platform for creating MapReduce programs
with Hadoop. Hive is a data warehouse infrastructure built for Hadoop for analysis and aggregation (summary of the data) of the data. Both of these commands are a compilation of the MapReduce commands. Pig procedure language where one describes procedures to apply on the Hadoop. Hives is SQL-like declarative language. Yahoo uses Pigs and Hives both in their Hadoop Toolkit. Here is an excellent resource from Lars George where he has compared both of these in detail.
Microsoft and Big Data
Microsoft is committed to making Hadoop accessible to a broader class of end users, developers and IT professionals. Accelerate your Hadoop deployment through simplicity of Hadoop on Windows, and the use of familiar Microsoft products.
- Apache Hadoop connector for Microsoft SQL Server
- Apache Hadoop connector for Microsoft Parallel Data Warehouse
Here is the link for further reading.
Most Important
I can not end this blog post if I do not talk about the one man from whom I have heard about Big Data very first time.
Pinal Dave with Dr. David DeWitt
… and of-course – Happy Valentines Day!
Reference:Â Pinal Dave (https://blog.sqlauthority.com)
14 Comments. Leave new
Hi Pinal,
Good morning,
Thanks & Happy valentines Day! too
“I love who loves SQL…..!!!!!”
Happy valentines Day! too
Pinal, good insight into how big data has evolved! Have you looked at HPCC Systems, a superior alternative to Hadoop? Also based on a shared nothing distributed architecture, the HPCC Systems platform provides for an excellent low cost one-stop solution to BI and analytics needs. HPCC Systems is a mature, enterprise ready, data intensive processing and delivery platform, architected from the ground up as a cohesive and consistent environment to cover big data extraction, transformation and loading (ETL), data processing, linking and real time querying. Powered by ECL, a data oriented declarative domain specific language for big data, the HPCC Systems Platform enables data scientists and analysts to directly express their data transformations and queries. Read more about how it compares to Hadoop at
I want to learn sqlserver please provide material
My first hard drive was a 10 MB (not a typo – MB!) “hard card” that I dropped into the expansion slot of my IBM PC (5150) some time in 1985. I had gone from 1982 until 1985 using the dual 5 1/4 inch 360K floppy discs and never thought I would fill up that 10 MB drive!
Hi ,
I am having one Date issue .For Eg if Date format is ‘2010-10-10 22:10:00.000’this is valid date and insrt to my database table. But if Date format is ‘2010-10-10 25:10:00.000’which is invalid date. It check the no. of hours is greter than 24 than it add one day to date and substract 24 hrs from time stamp.
Please advise.
Thanks in advance.
What is the datatype of the column that stores date values?
Thanks Pinal,
I was searching web to know about BIG data and hadoop, finally i end-up with your blog , simple and clar …
-Subbu
Thanks Penal,
Big Data is large amount of the data which is difficult or impossible for traditional relational database.
– simple definition .
These words enough to clear my Big Question of Big Data
Hi Pinal, Microsoft provided SQL Connector for Apache Hadoop (Linux) for SQL Server 2008 R2. Do we have a similar connector for SQL Server 2012 too ? How does SQL Server 2012 connect to Apache Hadoop on Linux ?
Four key characteristics that define big data:
>> Volume
>> Velocity
>> Variety
>> Value
Nonsense. These describe the capabilities of “Big Data” platforms; not the data itself.
For example, let’s say you have a workload that does not include “Volume”. By not meeting this metric, would you rule out using a Big Data platform?
his 4 keys don’t define BigData.. if anything.. big data is data that is too large to be processed quickly… so Volume goes hand in hand with big data,,, but Variety and Value do not equate to BigData
Hi Pinal Dave,
Actually I have executed one relationship as one hdfs table to one sql server table using sqoop export.
In the Hadoop is there any option export the data from
one hdfs table data to multiple sql server tables relationships,
many hdfs tables to one single sql server tables,
many hdfs tables to many sql server tables.
If there is possible to do the operations please reply for my question?