Power View, a feature of SQL Server 2012 Reporting Services Add-in for Microsoft SharePoint Server 2010 Enterprise Edition, is an interactive data exploration, visualization, and presentation experience. It provides intuitive ad-hoc reporting for business users such as data analysts, business decision makers, and information workers. Microsoft has recently released very interesting whitepaper which covers a sample scenario that validates the connectivity of the Power View reports to both PowerPivot workbooks and tabular models. This white paper talks about following important concepts about Power View:
Understanding the hardware and software requirements and their download locations
Installing and configuring the required infrastructure when Power View and its data models are on the same computer and on different computer
Installing and configuring a computer used for client access to Power View reports, models, Sharepoint 2012 and Power View in a workgroup
Configuring single sign-on access for double-hop scenarios with and without Kerberos
This whitepaper talks about many interesting scenarios. It would be really interesting to know if you are using Power View in your production environment. If yes, would you please share your experience over here.
Data Quality Services is a very important concept of SQL Server. I have recently started to explore the same and I am really learning some good concepts. Here are two very important blog posts which one should go over before continuing this blog post.
This article is introduction to Data Quality Services for beginners. We will be using an Excel file
Click on the image to enlarge the it.
In the first article we learned to install DQS. In this article we will see how we can learn about building Knowledge Base and using it to help us identify the quality of the data as well help correct the bad quality of the data.
Here are the two very important steps we will be learning in this tutorial.
Building a New Knowledge Base
Creating a New Data Quality Project
Let us start the building the Knowledge Base. Click on New Knowledge Base.
In our project we will be using the Excel as a knowledge base. Here is the Excel which we will be using. There are two columns. One is Colors and another is Shade. They are independent columns and not related to each other. The point which I am trying to show is that in Column A there are unique data and in Column B there are duplicate records.
Clicking on New Knowledge Base will bring up the following screen. Enter the name of the new knowledge base.
Clicking NEXT will bring up following screen where it will allow to select the EXCE file and it will also let users select the source column. I have selected Colors and Shade both as a source column.
Creating a domain is very important. Here you can create a unique domain or domain which is compositely build from Colors and Shade.
As this is the first example, I will create unique domain – for Colors I will create domain Colors and for Shade I will create domain Shade.
Here is the screen which will demonstrate how the screen will look after creating domains.
Clicking NEXT it will bring you to following screen where you can do the data discovery. Clicking on the START will start the processing of the source data provided.
Pre-processed data will show various information related to the source data. In our case it shows that Colors column have unique data whereas Shade have non-unique data and unique data rows are only two.
In the next screen you can actually add more rows as well see the frequency of the data as the values are listed unique.
Clicking next will publish the knowledge base which is just created.
Now the knowledge base is created. We will try to take any random data and attempt to do DQS implementation over it. I am using another excel sheet here for simplicity purpose. In reality you can easily use SQL Server table for the same.
Click on New Data Quality Project to see start DQS Project.
In the next screen it will ask which knowledge base to use. We will be using our Colors knowledge base which we have recently created.
In the Colors knowledge base we had two columns – 1) Colors and 2) Shade. In our case we will be using both of the mappings here. User can select one or multiple column mapping over here.
Now the most important phase of the complete project. Click on Start and it will make the cleaning process and shows various results.
In our case there were two columns to be processed and it completed the task with necessary information. It demonstrated that in Colors columns it has not corrected any value by itself but in Shade value there is a suggestion it has. We can train the DQS to correct values but let us keep that subject for future blog posts.
Now click next and keep the domain Colors selected left side. It will demonstrate that there are two incorrect columns which it needs to be corrected. Here is the place where once corrected value will be auto-corrected in future.
I manually corrected the value here and clicked on Approve radio buttons. As soon as I click on Approve buttons the rows will be disappeared from this tab and will move to Corrected Tab. If I had rejected tab it would have moved the rows to Invalid tab as well.
In this screen you can see how the corrected 2 rows are demonstrated. You can click on Correct tab and see previously validated 6 rows which passed the DQS process.
Now let us click on the Shade domain on the left side of the screen. This domain shows very interesting details as there DQS system guessed the correct answer as Dark with the confidence level of 77%. It is quite a high confidence level and manual observation also demonstrate that Dark is the correct answer. I clicked on Approve and the row moved to corrected tab.
On the next screen DQS shows the summary of all the activities. It also demonstrates how the correction of the quality of the data was performed. The user can explore their data to a SQL Server Table, CSV file or Excel.
The user also has an option to either explore data and all the associated cleansing info or data only. I will select Data only for demonstration purpose.
Clicking explore will generate the files.
Let us open the generated file. It will look as following and it looks pretty complete and corrected.
Well, we have successfully completed DQS Process. The process is indeed very easy. I suggest you try this out yourself and you will find it very easy to learn. In future we will go over advanced concepts.
Are you using this feature on your production server? If yes, would you please leave a comment with your environment and business need. It will be indeed interesting to see where it is implemented.
Data modeling is the most important task for any BI professional. Matter of the fact, the biggest challenge is to organizing disparate data into an analytic model that effectively and efficiently supports the reporting and analysis. SQL Server 2012 introduces BI Semantic Model (BISM), a single model that can support a broad range of reporting and analysis while blending two Analysis Services modeling experiences behind the scenes.
Multidimensional modeling – enables BI professionals to create sophisticated multidimensional cubes using traditional online analytical processing (OLAP).
Tabular modeling – provides self-service data modeling capabilities to business and data analysts.
As data modeling is evolving and business needs are growing new technologies and tools are emerging to help end users to make the necessary adjustment to the reporting and analysis needs. This white paper is will provide practical guidance to help you decide which SQL Server 2012 Analysis Services modeling experience – tabular or multidimensional.
Do let me know what do is your opinion as a comment. In simple word – I would like to know when will you use Tabular modeling and when Multidimensional modeling?
I am in front of computer 16+ hours of the day. However, I accept that I am bit lazy when it is about doing installation etc. I always prefer that IT department help me to install my computer and I use it right away. Same feeling I get when I have to install Beta, RC or any other version in my computer. I prefer virtual machines for the same. Again, I do not like to prepare virtual machines. Now following news is very exciting. As I Microsoft has prepared virtual machine for all of us using all essential Business Intelligence features of SQL Server 2012 Rc0.This virtual machine is ready with PowerView reports, Powerpivot Excel documents. Additionally it has all the content packs which helps you explore various business intelligence capability.
Power View provides intuitive ad-hoc reporting for business users such as data analysts, business decision makers, and information workers. They can easily create and interact with views of data from models based on PowerPivot workbooks or tabular models deployed to SQL Server 2012 Analysis Services (SSAS) Release Candidate 0 (RC0) instances. Power View is a browser-based Silverlight application launched from SharePoint Server 2010.
PowerPivot for Microsoft Excel 2010 provides ground-breaking technology, such as fast manipulation of large data sets (often millions of rows), streamlined integration of data, and the ability to effortlessly share your analysis through Microsoft SharePoint 2010.
I was presenting on SQL Server session at one of the Tech Ed On Road event in India. I was asked very interesting question during ‘Stump the Speaker‘ session. I am sharing the same with all of you over here.
Question: Can you tell me in simple words what is dimension, member and other terms of multidimensional space? There is no simple example for it.
This is extreme fundamental question if you know Analysis Service. Those who have no exposure to the same and have not yet started on this subject, may find it a bit difficult. I really liked his question so I decided to answer him there as well blog about the same over here.
Answer: Here are the most important terms of multidimensional space – dimension, member, value, attribute and size.
Dimension – It describes the point of interests for analysis.
Member – It is one of the point of interests in the dimension.
Value – It uniquely describes the member.
Attribute – It is collection of multiple members.
Size – It is total numbers for any dimension.
Let us understand this further detail taking example of any space. I am going to take example of distance as a space in our example.
Dimension – Distance is a dimension for us. Member – Kilometer – We can measure distance in Kilometer. Value – 4 – We can measure distance in the kilometer unit and the value of the unit can be 4. Attribute – Kilometer, Miles, Meter – The complete set of members is called attribute. Size – 100 KM – The maximum size decided for the dimension is called size.
The same example can be also defined by using time space. Here is the example using time space.
Dimension – Time
Member – Date
Value – 25
Attribute – 1, 2, 3…31
Size – 31
I hope it is clear enough that what are various multidimensional space and its terms.
My friend Jacob Sebastian runs SQL BI Quiz competition. Where there are 30 different questions on each day of the month. Winners get opportunity to participate in this Quiz, learn something new and win great awards.
Working with huge data is very common when it is about Data Warehousing. It is necessary to create Cubes on the data to make it meaningful and consumable. There are cases when retrieving the data from cube takes lots of the time.
Let us assume that your cube is returning you data very quickly. Suddenly on one day it is returning the data very slowly. What are the three things will you in order to diagnose this. After diagnose what you will do to resolve performance issue.
Niraj Bhatt works as an Enterprise Architect for a Fortune 500 company and has an innate passion for building / studying software systems. He is a top rated speaker at various technical forums including Tech·Ed, MCT Summit, Developer Summit, and Virtual Tech Days, among others. Having run a successful startup for four years Niraj enjoys working on – IT innovations that can impact an enterprise bottom line, streamlining IT budgets through IT consolidation, architecture and integration of systems, performance tuning, and review of enterprise applications. He has received Microsoft MVP award for ASP.NET, Connected Systems and most recently on Windows Azure. When he is away from his laptop, you will find him taking deep dives in automobiles, pottery, rafting, photography, cooking and financial statements though not necessarily in that order. He is also a manager/speaker at BDOTNET, Asia’s largest .NET user group.
As data in your applications grows it’s the database that usually becomes a bottleneck. It’s hard to scale a relational DB and the preferred approach for large scale applications is to create separate databases for writes and reads. These databases are referred as transactional database and reporting database. Though there are tools / techniques which can allow you to create snapshot of your transactional database for reporting purpose, sometimes they don’t quite fit the reporting requirements of an enterprise. These requirements typically are data analytics, effective schema (for an Information worker to self-service herself), historical data, better performance (flat data, no joins) etc. This is where a need for data warehouse or an OLAP system arises.
A Key point to remember is a data warehouse is mostly a relational database. It’s built on top of same concepts like Tables, Rows, Columns, Primary keys, Foreign Keys, etc. Before we talk about how data warehouses are typically structured let’s understand key components that can create a data flow between OLTP systems and OLAP systems. There are 3 major areas to it:
a) OLTP system should be capable of tracking its changes as all these changes should go back to data warehouse for historical recording. For e.g. if an OLTP transaction moves a customer from silver to gold category, OLTP system needs to ensure that this change is tracked and send to data warehouse for reporting purpose. A report in context could be how many customers divided by geographies moved from sliver to gold category. In data warehouse terminology this process is called Change Data Capture. There are quite a few systems that leverage database triggers to move these changes to corresponding tracking tables. There are also out of box features provided by some databases e.g. SQL Server 2008 offers Change Data Capture and Change Tracking for addressing such requirements.
b) After we make the OLTP system capable of tracking its changes we need to provision a batch process that can run periodically and takes these changes from OLTP system and dump them into data warehouse. There are many tools out there that can help you fill this gap – SQL Server Integration Services happens to be one of them.
c) So we have an OLTP system that knows how to track its changes, we have jobs that run periodically to move these changes to warehouse. The question though remains is how warehouse will record these changes? This structural change in data warehouse arena is often covered under something called Slowly Changing Dimension (SCD). While we will talk about dimensions in a while, SCD can be applied to pure relational tables too. SCD enables a database structure to capture historical data. This would create multiple records for a given entity in relational database and data warehouses prefer having their own primary key, often known as surrogate key.
As I mentioned a data warehouse is just a relational database but industry often attributes a specific schema style to data warehouses. These styles are Star Schema or Snowflake Schema. The motivation behind these styles is to create a flat database structure (as opposed to normalized one), which is easy to understand / use, easy to query and easy to slice / dice. Star schema is a database structure made up of dimensions and facts. Facts are generally the numbers (sales, quantity, etc.) that you want to slice and dice. Fact tables have these numbers and have references (foreign keys) to set of tables that provide context around those facts. E.g. if you have recorded 10,000 USD as sales that number would go in a sales fact table and could have foreign keys attached to it that refers to the sales agent responsible for sale and to time table which contains the dates between which that sale was made. These agent and time tables are called dimensions which provide context to the numbers stored in fact tables. This schema structure of fact being at center surrounded by dimensions is called Star schema. A similar structure with difference of dimension tables being normalized is called a Snowflake schema.
This relational structure of facts and dimensions serves as an input for another analysis structure called Cube. Though physically Cube is a special structure supported by commercial databases like SQL Server Analysis Services, logically it’s a multidimensional structure where dimensions define the sides of cube and facts define the content. Facts are often called as Measures inside a cube. Dimensions often tend to form a hierarchy. E.g. Product may be broken into categories and categories in turn to individual items. Category and Items are often referred as Levels and their constituents as Members with their overall structure called as Hierarchy. Measures are rolled up as per dimensional hierarchy. These rolled up measures are called Aggregates. Now this may seem like an overwhelming vocabulary to deal with but don’t worry it will sink in as you start working with Cubes and others.
Let’s see few other terms that we would run into while talking about data warehouses.
ODS or an Operational Data Store is a frequently misused term. There would be few users in your organization that want to report on most current data and can’t afford to miss a single transaction for their report. Then there is another set of users that typically don’t care how current the data is. Mostly senior level executives who are interesting in trending, mining, forecasting, strategizing, etc. don’t care for that one specific transaction. This is where an ODS can come in handy. ODS can use the same star schema and the OLAP cubes we saw earlier. The only difference is that the data inside an ODS would be short lived, i.e. for few months and ODS would sync with OLTP system every few minutes. Data warehouse can periodically sync with ODS either daily or weekly depending on business drivers.
Data marts are another frequently talked about topic in data warehousing. They are subject-specific data warehouse. Data warehouses that try to span over an enterprise are normally too big to scope, build, manage, track, etc. Hence they are often scaled down to something called Data mart that supports a specific segment of business like sales, marketing, or support. Data marts too, are often designed using star schema model discussed earlier. Industry is divided when it comes to use of data marts. Some experts prefer having data marts along with a central data warehouse. Data warehouse here acts as information staging and distribution hub with spokes being data marts connected via data feeds serving summarized data. Others eliminate the need for a centralized data warehouse citing that most users want to report on detailed data.
I am very excited that Fast Track Data Warehouse 3.0 reference guide has been announced. As a consultant I have always enjoyed working with Fast Track Data Warehouse project as it truly expresses the potential of the SQL Server Engine. Here is few details of the enhancement of the Fast Track Data Warehouse 3.0 reference architecture.
The SQL Server Fast Track Data Warehouse initiative provides a basic methodology and concrete examples for the deployment of balanced hardware and database configuration for a data warehousing workload. Balance is measured across the key components of a SQL Server installation; storage, server, application settings, and configuration settings for each component are evaluated.
FTDW 3.0 Architecture
Basic component architecture for FT 3.0 based systems.
New Memory Guidelines
Minimum and maximum tested memory configurations by server socket count.
Additional Startup Options
Notes for T-834 and setting for Lock Pages in Memory.
RAID1+0 now standard (RAID1 was used in FT 2.0).
Query provided for evaluating logical fragmentation.
Additional options for CI table loads.
Additional detail and explanation of FTDW MCR Rating.
I am often reminded by the fact that BI/data warehousing infrastructure is very brittle and not very adaptive to change. There are lots of basic use cases where data needs to be frequently loaded into SQL Server or another database. What I have found is that as long as the sources and targets stay the same, SSIS or any other ETL tool for that matter does a pretty good job handling these types of scenarios.
But what happens when you are faced with more challenging scenarios, where the data formats and possibly the data types of the source data are changing from customer to customer? Let’s examine a real life situation where a health management company receives claims data from their customers in various source formats. Even though this company supplied all their customers with the same claims forms, they ended up building one-off ETL applications to process the claims for each customer.
Why, you ask? Well, it turned out that the claims data from various regional hospitals they needed to process had slightly different data formats, e.g. “integer” versus “string” data field definitions. Moreover the data itself was represented with slight nuances, e.g. “0001124” or “1124” or “0000001124” to represent a particular account number, which forced them, as I eluded above, to build new ETL processes for each customer in order to overcome the inconsistencies in the various claims forms. As a result, they experienced a lot of redundancy in these ETL processes and recognized quickly that their system would become more difficult to maintain over time.
So imagine for a moment that you could use an ETL tool that helps you abstract the data formats so that your ETL transformation process becomes more reusable. Imagine that one claims form represents a data item as a string – acc_no(varchar) – while a second claims form represents the same data item as an integer – account_no(integer). This would break your traditional ETL process as the data mappings are hard-wired. But in a world of abstracted definitions, all you need to do is create parallel data mappings to a common data representation used within your ETL application; that is, map both external data fields to a common attribute whose name and type remain unchanged within the application.
acc_no(varchar) is mapped to account_number(integer)
expressor Studio first claim form schema mapping
account_no(integer) is also mapped to account_number(integer)
expressor Studio second claim form schema mapping
All the data processing logic that follows manipulates the data as an integer value named account_number.
Well, these are the kind of problems that that the expressor data integration solution automates for you. I’ve been following them since last year and encourage you to check them out by downloading their free expressor Studio ETL software.