Power View, a feature of SQL Server 2012 Reporting Services Add-in for Microsoft SharePoint Server 2010 Enterprise Edition, is an interactive data exploration, visualization, and presentation experience. It provides intuitive ad-hoc reporting for business users such as data analysts, business decision makers, and information workers. Microsoft has recently released very interesting whitepaper which covers a sample scenario that validates the connectivity of the Power View reports to both PowerPivot workbooks and tabular models. This white paper talks about following important concepts about Power View:
Understanding the hardware and software requirements and their download locations
Installing and configuring the required infrastructure when Power View and its data models are on the same computer and on different computer
Installing and configuring a computer used for client access to Power View reports, models, Sharepoint 2012 and Power View in a workgroup
Configuring single sign-on access for double-hop scenarios with and without Kerberos
This whitepaper talks about many interesting scenarios. It would be really interesting to know if you are using Power View in your production environment. If yes, would you please share your experience over here.
Data Quality Services is a very important concept of SQL Server. I have recently started to explore the same and I am really learning some good concepts. Here are two very important blog posts which one should go over before continuing this blog post.
This article is introduction to Data Quality Services for beginners. We will be using an Excel file
Click on the image to enlarge the it.
In the first article we learned to install DQS. In this article we will see how we can learn about building Knowledge Base and using it to help us identify the quality of the data as well help correct the bad quality of the data.
Here are the two very important steps we will be learning in this tutorial.
Building a New Knowledge Base
Creating a New Data Quality Project
Let us start the building the Knowledge Base. Click on New Knowledge Base.
In our project we will be using the Excel as a knowledge base. Here is the Excel which we will be using. There are two columns. One is Colors and another is Shade. They are independent columns and not related to each other. The point which I am trying to show is that in Column A there are unique data and in Column B there are duplicate records.
Clicking on New Knowledge Base will bring up the following screen. Enter the name of the new knowledge base.
Clicking NEXT will bring up following screen where it will allow to select the EXCE file and it will also let users select the source column. I have selected Colors and Shade both as a source column.
Creating a domain is very important. Here you can create a unique domain or domain which is compositely build from Colors and Shade.
As this is the first example, I will create unique domain – for Colors I will create domain Colors and for Shade I will create domain Shade.
Here is the screen which will demonstrate how the screen will look after creating domains.
Clicking NEXT it will bring you to following screen where you can do the data discovery. Clicking on the START will start the processing of the source data provided.
Pre-processed data will show various information related to the source data. In our case it shows that Colors column have unique data whereas Shade have non-unique data and unique data rows are only two.
In the next screen you can actually add more rows as well see the frequency of the data as the values are listed unique.
Clicking next will publish the knowledge base which is just created.
Now the knowledge base is created. We will try to take any random data and attempt to do DQS implementation over it. I am using another excel sheet here for simplicity purpose. In reality you can easily use SQL Server table for the same.
Click on New Data Quality Project to see start DQS Project.
In the next screen it will ask which knowledge base to use. We will be using our Colors knowledge base which we have recently created.
In the Colors knowledge base we had two columns – 1) Colors and 2) Shade. In our case we will be using both of the mappings here. User can select one or multiple column mapping over here.
Now the most important phase of the complete project. Click on Start and it will make the cleaning process and shows various results.
In our case there were two columns to be processed and it completed the task with necessary information. It demonstrated that in Colors columns it has not corrected any value by itself but in Shade value there is a suggestion it has. We can train the DQS to correct values but let us keep that subject for future blog posts.
Now click next and keep the domain Colors selected left side. It will demonstrate that there are two incorrect columns which it needs to be corrected. Here is the place where once corrected value will be auto-corrected in future.
I manually corrected the value here and clicked on Approve radio buttons. As soon as I click on Approve buttons the rows will be disappeared from this tab and will move to Corrected Tab. If I had rejected tab it would have moved the rows to Invalid tab as well.
In this screen you can see how the corrected 2 rows are demonstrated. You can click on Correct tab and see previously validated 6 rows which passed the DQS process.
Now let us click on the Shade domain on the left side of the screen. This domain shows very interesting details as there DQS system guessed the correct answer as Dark with the confidence level of 77%. It is quite a high confidence level and manual observation also demonstrate that Dark is the correct answer. I clicked on Approve and the row moved to corrected tab.
On the next screen DQS shows the summary of all the activities. It also demonstrates how the correction of the quality of the data was performed. The user can explore their data to a SQL Server Table, CSV file or Excel.
The user also has an option to either explore data and all the associated cleansing info or data only. I will select Data only for demonstration purpose.
Clicking explore will generate the files.
Let us open the generated file. It will look as following and it looks pretty complete and corrected.
Well, we have successfully completed DQS Process. The process is indeed very easy. I suggest you try this out yourself and you will find it very easy to learn. In future we will go over advanced concepts.
Are you using this feature on your production server? If yes, would you please leave a comment with your environment and business need. It will be indeed interesting to see where it is implemented.
Decoding the human genome originally took 10 years to process; now it can be achieved in one week – The Economist.
This blog post is written in response to the T-SQL Tuesday post of The Big Data. This is a very interesting subject. Data is growing every single day. I remember my first computer which had 1 GB of the Harddrive. I had told my dad that I will never need any more hard drive, we are good for next 10 years. I bought much larger Harddrive after 2 years and today I have NAS at home which can hold 2 TB and have few file hosting in the cloud as well. Well the point is, amount of the data any individual deals with has increased significantly.
There was a time of floppy drives. Today some of the auto correct software even does not recognize that word. However, USB drive, Pen drives and Jump drives are common names across industry. It is race – I really do not know where it will stop.
Same way the amount of the data has grown so wild that relational database is not able to handle the processing of this amount of the data. Conventional RDBMS faces challenges to process and analysis data beyond certain very large data. Big Data is large amount of the data which is difficult or impossible for traditional relational database. Current moving target limits for Big data is terabytes, exabytes and zettabytes.
Hadoop is a software framework which supports data intensive processes and enables applications to work with Big Data. Technically it is inspired by MapReduces technology, however there is very interesting story behind its name. The creator of the Hadoop had named it Hadoop because his son’s toy elephant was named Hadoop. For the same reasons, the logo of the Hadoop is yellow toy elephant.
There are two very famous companies uses Hadoop to process their large data – Facebook and Yahoo. Hadoop platform can solve problems where the deep analysis is complex and unstructured but needs to be done in reasonable time.
Hadoop is architectured to run on a large number of machines where ‘shared nothing’ is the architecture. All the independent server can be put use by Hadoop technology. Hadoop technology maintains and manages the data among all the independent servers. Individual user can not directly gain the access to the data as data is divided among this servers. Additionally, a single data can be shared on multiple server which gives availability of the data in case of the disaster or single machine failure. Hadoop uses MapReduce software framework to return unified data.
This technology is much simpler conceptually but very powerful when put along with Hadoop framework. There are two major steps: 1) Map 2) Reduce.
In Map step master node takes input and divides into simple smaller chunks and provides it to other worker node. In Reduce step it collects all the small solution of the problem and returns as output in one unified answer. Both of this steps uses function which relies on Key-Value pairs. This process runs on the various nodes in parallel and brings faster results for framework.
Pigs and Hives
Pig is high level platform for creating MapReduce programs with Hadoop. Hive is a data warehouse infrastructure built for Hadoop for analysis and aggregation (summary of the data) of the data. Both of this commands are compilation of the MapReduce commands. Pig procedure language where one describes procedures to apply on the Hadoop. Hives is SQL-like declarative language. Yahoo uses Pigs and Hives both in their Hadoop Toolkit. Here is excellent resource from Lars George where he has compared both of this in detail.
Microsoft and Big Data
Microsoft is committed to making Hadoop accessible to a broader class of end users, developers and IT professionals. Accelerate your Hadoop deployment through simplicity of Hadoop on Windows, and the use of familiar Microsoft products.
Apache Hadoop connector for Microsoft SQL Server
Apache Hadoop connector for Microsoft Parallel DataWarehouse
I have received lots of interesting comments on the blog about what is in it. Many of you got it absolutely incorrect and many got it close to the right answer but no one got it 100% correct. Well, no issue at all, I am going to give away the price to whoever has the closest answer first in personal email.
Here is the answer to the question about what is in the box? Here it is – the box has my new book. In fact, I should say our new book as I co-authored this book with my very good friend Vinod Kumar. We had real blast writing this book together and had lots of interesting conversation when we were writing this book. This book has one simple goal – “master the basics.”
This book is not only for people who are preparing for interview. This book is for every one who wants to revisit the basics and wants to prepare themselves to the technology. One always needs to have practical knowledge to do their duty efficiently. This book talks about more than basics. There are multiple ways to present learning – either we can create simple book or make it interesting. We have decided the learning should be interactive and have opted for Interview Questions and Answer format.
Here is quick interview which we have done together.
The core concept of this book will continue to evolve over time. I am sure many of you will come along with us on this journey and submit your suggestions to us to make this book a key reference for anybody who wants to start with SQL server. Today we want to acknowledge the fact that you will help us keep this book alive forever with the latest updates. We want to thank everyone who participates in this journey with us.
With the release of expressor 3.3, expressor software has added a significant new feature to the expressor Studio tool – the ability to easily extend functionality through the incorporation of reusable script files. A developer using expressor Studio may write these scripts and add them to any number of projects, or you can integrate scripts written by other developers. Let’s see how this works.
Suppose you want to execute a one-to-many application in which each incoming record needs to be parsed into multiple output records. For example, a record containing monthly data over a year period needs to be reworked so that each emitted record contains data from a single month. The following fragment shows two representative data records, where the first field is the year and the following fields are the monthly data, January through December.
And think of each emitted record as comprised of three fields.
Now you can write generic code within the Datascript Module that can handle any incoming record in a one-to-many application.
Since you want to use the Datascript Module in multiple applications, create this artifact within an expressor library.
In this example, the library is named modules_library.0 and the Datascript Module is named OneToMany.
Within the OneToMany module, include the following scripting.
Note that line 4, which defines the fully qualified name of the module, is entered by Studio when you first create the module. It is a concatenation of the library name and module file name.
The Datascript Module includes a single function named oneToMany, which takes as arguments the name of the attribute holding a value that uniquely identifies the incoming record and the incoming record that needs to be divided into multiple output records.
Observe how the code uses the generic field names GroupAttribute, Atr, and Val to populate each output record. It’s not until the collection of output records is returned to the calling code that the coding uses the actual attribute names (Year, Month, Value) of the output record.
Consequently, the actual attribute names used by the incoming and outgoing records are only referenced within the calling code, which is specific to an application.
All coding within the Datascript Module uses generic names applicable to any incoming data.
Now, let’s turn our attention to the code that uses this Datascript Module. In this case, the application is quite simple. It reads a file that contains the yearly summary records, uses a Transform operator to process each record into multiple output records, and then writes each monthly output record to a file.
Line 2 cross-references the Datascript Module. This statement is equivalent to simply copying and pasting the scripting from the module into the operator’s coding.
Line 5 invokes the oneToMany function defined in the module.
Lines 10 through 20 iterate through the collection of output records returned by the module’s oneToMany function, emitting each record individually.
To emit more than one record from the Transform operator, the scripting uses an iterator function, another new feature introduced in expressor 3.3.
Using modules written by other developers is no more involved. In fact, expressor 3.3 ships with many modules that your code can utilize. These include a cURL module that you can use to invoke on Web services or FTP servers, a SAX XML module that you can use to parse an XML document, a socket module that you can use to interact with an external application, a SQL module that allows your code to execute a SQL statement against a remote database, as well as some expressor Datascript Modules that include utility functions useful for parsing CSV files and serializing and de-serializing expressor Datascript tables.
To use any of these modules, simply include the necessary require statement at the beginning of your scripting. The expressor Community Knowledge Base includes entries that describe use of these modules. The expressor Studio desktop ETL tool is freely downloadable from the expressor website. I encourage you to give it a try!
I was recently working on learning various new stuff. I just would like to share very interesting resources here today. Microsoft SQL Server Migration Assistant (SSMA) is a toolkit that dramatically cuts the effort, cost, and risk of migrating from any other data platform to SQL Server 2005, SQL Server 2008, SQL Server 2008 R2 and SQL Azure.
One very front page of the blog, I noticed very interesting diagram – where it displays four database products. One can click on any of them to go to detail description of the same. Here is interesting Video regarding this tool.
Many of us are tend to overlook simple things even if we are capable of doing complex work. In SQL Server 2008, inline variable assignment is available. This feature exists from last 3 years, but I hardly see its utilization. One of the common arguments was that as the project migrated from the earlier version, the feature disappears. I totally accept this argument and acknowledge it. However, my point is that this new feature should be used in all the new coding – what is your opinion?
The code which we used in SQL Server 2005 and the earlier version is as follows:
DECLARE @iVariable INT, @vVariable VARCHAR(100), @dDateTime DATETIME
SET @iVariable = 1 SET @vVariable = 'myvar' SET @dDateTime = GETDATE() SELECT @iVariable iVar, @vVariable vVar, @dDateTime dDT
Niraj Bhatt works as an Enterprise Architect for a Fortune 500 company and has an innate passion for building / studying software systems. He is a top rated speaker at various technical forums including Tech·Ed, MCT Summit, Developer Summit, and Virtual Tech Days, among others. Having run a successful startup for four years Niraj enjoys working on – IT innovations that can impact an enterprise bottom line, streamlining IT budgets through IT consolidation, architecture and integration of systems, performance tuning, and review of enterprise applications. He has received Microsoft MVP award for ASP.NET, Connected Systems and most recently on Windows Azure. When he is away from his laptop, you will find him taking deep dives in automobiles, pottery, rafting, photography, cooking and financial statements though not necessarily in that order. He is also a manager/speaker at BDOTNET, Asia’s largest .NET user group.
As data in your applications grows it’s the database that usually becomes a bottleneck. It’s hard to scale a relational DB and the preferred approach for large scale applications is to create separate databases for writes and reads. These databases are referred as transactional database and reporting database. Though there are tools / techniques which can allow you to create snapshot of your transactional database for reporting purpose, sometimes they don’t quite fit the reporting requirements of an enterprise. These requirements typically are data analytics, effective schema (for an Information worker to self-service herself), historical data, better performance (flat data, no joins) etc. This is where a need for data warehouse or an OLAP system arises.
A Key point to remember is a data warehouse is mostly a relational database. It’s built on top of same concepts like Tables, Rows, Columns, Primary keys, Foreign Keys, etc. Before we talk about how data warehouses are typically structured let’s understand key components that can create a data flow between OLTP systems and OLAP systems. There are 3 major areas to it:
a) OLTP system should be capable of tracking its changes as all these changes should go back to data warehouse for historical recording. For e.g. if an OLTP transaction moves a customer from silver to gold category, OLTP system needs to ensure that this change is tracked and send to data warehouse for reporting purpose. A report in context could be how many customers divided by geographies moved from sliver to gold category. In data warehouse terminology this process is called Change Data Capture. There are quite a few systems that leverage database triggers to move these changes to corresponding tracking tables. There are also out of box features provided by some databases e.g. SQL Server 2008 offers Change Data Capture and Change Tracking for addressing such requirements.
b) After we make the OLTP system capable of tracking its changes we need to provision a batch process that can run periodically and takes these changes from OLTP system and dump them into data warehouse. There are many tools out there that can help you fill this gap – SQL Server Integration Services happens to be one of them.
c) So we have an OLTP system that knows how to track its changes, we have jobs that run periodically to move these changes to warehouse. The question though remains is how warehouse will record these changes? This structural change in data warehouse arena is often covered under something called Slowly Changing Dimension (SCD). While we will talk about dimensions in a while, SCD can be applied to pure relational tables too. SCD enables a database structure to capture historical data. This would create multiple records for a given entity in relational database and data warehouses prefer having their own primary key, often known as surrogate key.
As I mentioned a data warehouse is just a relational database but industry often attributes a specific schema style to data warehouses. These styles are Star Schema or Snowflake Schema. The motivation behind these styles is to create a flat database structure (as opposed to normalized one), which is easy to understand / use, easy to query and easy to slice / dice. Star schema is a database structure made up of dimensions and facts. Facts are generally the numbers (sales, quantity, etc.) that you want to slice and dice. Fact tables have these numbers and have references (foreign keys) to set of tables that provide context around those facts. E.g. if you have recorded 10,000 USD as sales that number would go in a sales fact table and could have foreign keys attached to it that refers to the sales agent responsible for sale and to time table which contains the dates between which that sale was made. These agent and time tables are called dimensions which provide context to the numbers stored in fact tables. This schema structure of fact being at center surrounded by dimensions is called Star schema. A similar structure with difference of dimension tables being normalized is called a Snowflake schema.
This relational structure of facts and dimensions serves as an input for another analysis structure called Cube. Though physically Cube is a special structure supported by commercial databases like SQL Server Analysis Services, logically it’s a multidimensional structure where dimensions define the sides of cube and facts define the content. Facts are often called as Measures inside a cube. Dimensions often tend to form a hierarchy. E.g. Product may be broken into categories and categories in turn to individual items. Category and Items are often referred as Levels and their constituents as Members with their overall structure called as Hierarchy. Measures are rolled up as per dimensional hierarchy. These rolled up measures are called Aggregates. Now this may seem like an overwhelming vocabulary to deal with but don’t worry it will sink in as you start working with Cubes and others.
Let’s see few other terms that we would run into while talking about data warehouses.
ODS or an Operational Data Store is a frequently misused term. There would be few users in your organization that want to report on most current data and can’t afford to miss a single transaction for their report. Then there is another set of users that typically don’t care how current the data is. Mostly senior level executives who are interesting in trending, mining, forecasting, strategizing, etc. don’t care for that one specific transaction. This is where an ODS can come in handy. ODS can use the same star schema and the OLAP cubes we saw earlier. The only difference is that the data inside an ODS would be short lived, i.e. for few months and ODS would sync with OLTP system every few minutes. Data warehouse can periodically sync with ODS either daily or weekly depending on business drivers.
Data marts are another frequently talked about topic in data warehousing. They are subject-specific data warehouse. Data warehouses that try to span over an enterprise are normally too big to scope, build, manage, track, etc. Hence they are often scaled down to something called Data mart that supports a specific segment of business like sales, marketing, or support. Data marts too, are often designed using star schema model discussed earlier. Industry is divided when it comes to use of data marts. Some experts prefer having data marts along with a central data warehouse. Data warehouse here acts as information staging and distribution hub with spokes being data marts connected via data feeds serving summarized data. Others eliminate the need for a centralized data warehouse citing that most users want to report on detailed data.
I am very excited that Fast Track Data Warehouse 3.0 reference guide has been announced. As a consultant I have always enjoyed working with Fast Track Data Warehouse project as it truly expresses the potential of the SQL Server Engine. Here is few details of the enhancement of the Fast Track Data Warehouse 3.0 reference architecture.
The SQL Server Fast Track Data Warehouse initiative provides a basic methodology and concrete examples for the deployment of balanced hardware and database configuration for a data warehousing workload. Balance is measured across the key components of a SQL Server installation; storage, server, application settings, and configuration settings for each component are evaluated.
FTDW 3.0 Architecture
Basic component architecture for FT 3.0 based systems.
New Memory Guidelines
Minimum and maximum tested memory configurations by server socket count.
Additional Startup Options
Notes for T-834 and setting for Lock Pages in Memory.
RAID1+0 now standard (RAID1 was used in FT 2.0).
Query provided for evaluating logical fragmentation.
Additional options for CI table loads.
Additional detail and explanation of FTDW MCR Rating.
An integral part of any BI system is the data warehouse—a central repository of data that is regularly refreshed from the source systems. The new data is transferred at regular intervals by extract, transform, and load (ETL) processes.
This whitepaper talks about what are best practices for Data Warehousing. This whitepaper discusses ETL, Analysis, Reporting as well relational database. The main focus of this whitepaper is on mainly ‘architecture’ and ‘performance’.
The traditional form of storing digital data has been disk storage. However, the huge advances in technology means that there has been a huge need for data storage to evolve to keep up with the fast-changing times. Microsoft SQL Server has gone through a huge overhaul in order to keep up with the amount of data storage that is necessary, and that is where data warehousing comes into play.
For many online applications, there is a need to not only access small amount of information from disk storage, but large amounts in the forms of sets. SQL Server allows access to these sets of data in a sequential manner that optimizes computer. This proprietary technology makes SQL Server the go-to system to fit these computing needs.
SQL Server has also solved the problem of changing from “data marts” to data warehouses. As companies grow, their digital storage needs change and expand as well. Previously, transferring a system over to an SQL Server meant having to destroy old data during the transfer. This was the source of a lot of data loss and server problems. However, the new versions of SQL Server have built-in systems that help organizations transfer to the larger data storage system.
The costs of data storage and data warehousing are becoming a huge concerns for small businesses and corporations alike. It is a known rule of thumb that the larger the main server, the more expensive it is. SQL Server also addresses this problem by allowing data warehousing to be run on multiple processors – in the end rendering it much cheaper than having one main data warehousing server. SQL Servers uses the “hub-and-spoke” system, which allows data warehousing to be broken up by multiple systems without losing any data.
A huge question is how SQL Server can store so much data. As with any storage system, some data is compressed before it is stored. Other system will undergo lengthy processes in order to retrieve this compressed data due to updates and possible loss of information, but SQL Server forgoes this updating process, which allows for quick and easy access as well as compression.
The issue of compression of too much data is also solved with parallel processing. In this system, multiple processors are “enlisted” in order to help store the data. SQL Server has the technology to divide the data appropriately, and then easily access this data on command. Rather than slowing one server down, the data can be divided up to be easily handled.
One of the beauties of computing is that the computer can calculate probabilities much faster than a human – indeed, this was one of the main functions of computer development. SQL Server uses this capacity to store data, using “Probability of Access” technology. SQL Server can determine which data is “hot” and likely to be accessed the most, which is moderate, and which is cold – that is, there is a low probability of access. With these designations, SQL Server can store the Hot data in “high performance” storage, so it can be quickly recovered, and cold data is stored elsewhere, freeing up the processors to work more efficiently.
SQL Server has not only optimized storage of “static” data, or individual points of information, but has created a technology that will also store “streaming” data. Streaming data is a way to record multiple “events” over a period of time – like measurements of temperature thousands of times per minute for instance. This kind of data storage would completely bog down another server, but SQL Server has developed the technology to efficiently store this data without slowing down processors. And, of course, the data will be easily accessible because of the data warehousing systems already discussed.
Streaming data is difficult for most servers because of storage purposes. This streaming data overwhelms the servers because they cannot store and retrieve this much data efficiently, and it bogs down the system. Other companies have attempted to create programs that allow for access of streaming data, but it often comes at the cost of processor speed. The usual process is to store streaming data as historical data then retrieving it as static data, which slows down all the processors. SQL Server has created a system that allows for analysis of this data by showing it in the interactive sector.
In summary, SQL Server is at the very forefront of future data warehousing – DW 2.0. SQL Server is able to store large amounts of data at a low cost and with maximum efficiency. Microsoft has taken into account the need for multiple processors and an easy way to transfer old data into this new system. The technology behind SQL Server is sure to make data access much more efficient.