SQL SERVER – Guest Post – Architecting Data Warehouse – Niraj Bhatt

Niraj Bhatt works as an Enterprise Architect for a Fortune 500 company and has an innate passion for building / studying software systems. He is a top rated speaker at various technical forums including Tech·Ed, MCT Summit, Developer Summit, and Virtual Tech Days, among others. Having run a successful startup for four years Niraj enjoys working on – IT innovations that can impact an enterprise bottom line, streamlining IT budgets through IT consolidation, architecture and integration of systems, performance tuning, and review of enterprise applications. He has received Microsoft MVP award for ASP.NET, Connected Systems and most recently on Windows Azure. When he is away from his laptop, you will find him taking deep dives in automobiles, pottery, rafting, photography, cooking and financial statements though not necessarily in that order. He is also a manager/speaker at BDOTNET, Asia’s largest .NET user group.

Here is the guest post by Niraj Bhatt.

As data in your applications grows it’s the database that usually becomes a bottleneck. It’s hard to scale a relational DB and the preferred approach for large scale applications is to create separate databases for writes and reads. These databases are referred as transactional database and reporting database. Though there are tools / techniques which can allow you to create snapshot of your transactional database for reporting purpose, sometimes they don’t quite fit the reporting requirements of an enterprise. These requirements typically are data analytics, effective schema (for an Information worker to self-service herself), historical data, better performance (flat data, no joins) etc. This is where a need for data warehouse or an OLAP system arises.

A Key point to remember is a data warehouse is mostly a relational database. It’s built on top of same concepts like Tables, Rows, Columns, Primary keys, Foreign Keys, etc. Before we talk about how data warehouses are typically structured let’s understand key components that can create a data flow between OLTP systems and OLAP systems. There are 3 major areas to it:

a) OLTP system should be capable of tracking its changes as all these changes should go back to data warehouse for historical recording. For e.g. if an OLTP transaction moves a customer from silver to gold category, OLTP system needs to ensure that this change is tracked and send to data warehouse for reporting purpose. A report in context could be how many customers divided by geographies moved from sliver to gold category. In data warehouse terminology this process is called Change Data Capture. There are quite a few systems that leverage database triggers to move these changes to corresponding tracking tables. There are also out of box features provided by some databases e.g. SQL Server 2008 offers Change Data Capture and Change Tracking for addressing such requirements.

b) After we make the OLTP system capable of tracking its changes we need to provision a batch process that can run periodically and takes these changes from OLTP system and dump them into data warehouse. There are many tools out there that can help you fill this gap – SQL Server Integration Services happens to be one of them.

c) So we have an OLTP system that knows how to track its changes, we have jobs that run periodically to move these changes to warehouse. The question though remains is how warehouse will record these changes? This structural change in data warehouse arena is often covered under something called Slowly Changing Dimension (SCD). While we will talk about dimensions in a while, SCD can be applied to pure relational tables too. SCD enables a database structure to capture historical data. This would create multiple records for a given entity in relational database and data warehouses prefer having their own primary key, often known as surrogate key.

As I mentioned a data warehouse is just a relational database but industry often attributes a specific schema style to data warehouses. These styles are Star Schema or Snowflake Schema. The motivation behind these styles is to create a flat database structure (as opposed to normalized one), which is easy to understand / use, easy to query and easy to slice / dice. Star schema is a database structure made up of dimensions and facts. Facts are generally the numbers (sales, quantity, etc.) that you want to slice and dice. Fact tables have these numbers and have references (foreign keys) to set of tables that provide context around those facts. E.g. if you have recorded 10,000 USD as sales that number would go in a sales fact table and could have foreign keys attached to it that refers to the sales agent responsible for sale and to time table which contains the dates between which that sale was made. These agent and time tables are called dimensions which provide context to the numbers stored in fact tables. This schema structure of fact being at center surrounded by dimensions is called Star schema. A similar structure with difference of dimension tables being normalized is called a Snowflake schema.

This relational structure of facts and dimensions serves as an input for another analysis structure called Cube. Though physically Cube is a special structure supported by commercial databases like SQL Server Analysis Services, logically it’s a multidimensional structure where dimensions define the sides of cube and facts define the content. Facts are often called as Measures inside a cube. Dimensions often tend to form a hierarchy. E.g. Product may be broken into categories and categories in turn to individual items. Category and Items are often referred as Levels and their constituents as Members with their overall structure called as Hierarchy. Measures are rolled up as per dimensional hierarchy. These rolled up measures are called Aggregates. Now this may seem like an overwhelming vocabulary to deal with but don’t worry it will sink in as you start working with Cubes and others.

Let’s see few other terms that we would run into while talking about data warehouses.

ODS or an Operational Data Store is a frequently misused term. There would be few users in your organization that want to report on most current data and can’t afford to miss a single transaction for their report. Then there is another set of users that typically don’t care how current the data is. Mostly senior level executives who are interesting in trending, mining, forecasting, strategizing, etc. don’t care for that one specific transaction. This is where an ODS can come in handy. ODS can use the same star schema and the OLAP cubes we saw earlier. The only difference is that the data inside an ODS would be short lived, i.e. for few months and ODS would sync with OLTP system every few minutes. Data warehouse can periodically sync with ODS either daily or weekly depending on business drivers.

Data marts are another frequently talked about topic in data warehousing. They are subject-specific data warehouse. Data warehouses that try to span over an enterprise are normally too big to scope, build, manage, track, etc. Hence they are often scaled down to something called Data mart that supports a specific segment of business like sales, marketing, or support. Data marts too, are often designed using star schema model discussed earlier. Industry is divided when it comes to use of data marts. Some experts prefer having data marts along with a central data warehouse. Data warehouse here acts as information staging and distribution hub with spokes being data marts connected via data feeds serving summarized data. Others eliminate the need for a centralized data warehouse citing that most users want to report on detailed data.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQLAuthority News – Fast Track Data Warehouse 3.0 Reference Guide

http://msdn.microsoft.com/en-us/library/gg605238.aspx

I am very excited that Fast Track Data Warehouse 3.0 reference guide has been announced. As a consultant I have always enjoyed working with Fast Track Data Warehouse project as it truly expresses the potential of the SQL Server Engine. Here is few details of the enhancement of the Fast Track Data Warehouse 3.0 reference architecture.

The SQL Server Fast Track Data Warehouse initiative provides a basic methodology and concrete examples for the deployment of balanced hardware and database configuration for a data warehousing workload. Balance is measured across the key components of a SQL Server installation; storage, server, application settings, and configuration settings for each component are evaluated.

Description Note
FTDW 3.0 Architecture Basic component architecture for FT 3.0 based systems.
New Memory Guidelines Minimum and maximum tested memory configurations by server socket count.
Additional Startup Options Notes for T-834 and setting for Lock Pages in Memory.
Storage Configuration RAID1+0 now standard (RAID1 was used in FT 2.0).
Evaluating Fragmentation Query provided for evaluating logical fragmentation.
Loading Data Additional options for CI table loads.
MCR Additional detail and explanation of FTDW MCR Rating.

Read white paper on fast track data warehousing.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

 

SQL SERVER – Introduction to Adaptive ETL Tool – How adaptive is your ETL?

I am often reminded by the fact that BI/data warehousing infrastructure is very brittle and not very adaptive to change. There are lots of basic use cases where data needs to be frequently loaded into SQL Server or another database. What I have found is that as long as the sources and targets stay the same, SSIS or any other ETL tool for that matter does a pretty good job handling these types of scenarios.

But what happens when you are faced with more challenging scenarios, where the data formats and possibly the data types of the source data are changing from customer to customer?  Let’s examine a real life situation where a health management company receives claims data from their customers in various source formats. Even though this company supplied all their customers with the same claims forms, they ended up building one-off ETL applications to process the claims for each customer.

Why, you ask? Well, it turned out that the claims data from various regional hospitals they needed to process had slightly different data formats, e.g. “integer” versus “string” data field definitions.  Moreover the data itself was represented with slight nuances, e.g. “0001124” or “1124” or “0000001124” to represent a particular account number, which forced them, as I eluded above, to build new ETL processes for each customer in order to overcome the inconsistencies in the various claims forms.  As a result, they experienced a lot of redundancy in these ETL processes and recognized quickly that their system would become more difficult to maintain over time.

So imagine for a moment that you could use an ETL tool that helps you abstract the data formats so that your ETL transformation process becomes more reusable. Imagine that one claims form represents a data item as a string – acc_no(varchar) – while a second claims form represents the same data item as an integer – account_no(integer). This would break your traditional ETL process as the data mappings are hard-wired.  But in a world of abstracted definitions, all you need to do is create parallel data mappings to a common data representation used within your ETL application; that is, map both external data fields to a common attribute whose name and type remain unchanged within the application.

acc_no(varchar) is mapped to account_number(integer)

expressor Studio first claim form schema mapping

expressor Studio first claim form schema mapping

account_no(integer) is also mapped to account_number(integer)

expressor Studio second claim form schema mapping

expressor Studio second claim form schema mapping

All the data processing logic that follows manipulates the data as an integer value named account_number.

Well, these are the kind of problems that that the expressor data integration solution automates for you.  I’ve been following them since last year and encourage you to check them out by downloading their free expressor Studio ETL software.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – Fast Track Data Warehouse for SQL Server 2008

I recently attended a wonderful training session organized by Microsoft on Fast Track Data Warehouse Reference Architectures. If you are regular reader of my blog, you will be well aware of the fact that I am more of the Relational guy than a Business Intelligence professional. I was initially a bit skeptic about this training. However, once I start learning about it, to my surprise, I thought that I am the perfect guy to learn this. In fact, I realized that few of the tricks which this course is suggesting have already been implemented in my earlier consulting assignments.

Fast Track Data Warehouse is a very unique effort by Microsoft, where few Reference Architectures which offer scalability and reliability to huge database are proposed. Fast Track Datasheet provides some excellent points about this reference architecture. I am listing few of the same here.

  • Scale from 4 up to 48 Terabytes using compression capabilities in SQL Server 2008 Enterprise
  • Choose from industry-standard hardware from Dell, HP, Bull, IBM, EMC and other leading vendors
  • Pre-configured servers, storage and networking, specifically balanced and optimized for warehousing
  • Implement an enterprise-class solution for less than one-third the price of a comparable Oracle system
  • Fast Track Data Warehouse has a primary emphasis on eliminating potential performance bottlenecks.
  • Optimized for sequential IO rather than random IO, each Fast Track Data Warehouse is designed to provide up to 200 MB/s per CPU core

The Fast Track approach derives its strength from the advanced data warehouse enhancements included in SQL Server 2008 Enterprise such as compression, parallel partitioning and star join query optimization

    Well, I think it is great effort by Microsoft and I must express deep gratitude to Microsoft for giving me opportunity to learn this unique initiative. I clearly see that in near future, lots of Data Warehousing solutions will be switching to this solution. You can read more about this reference architecture over here. The image used in blog post is taken from MS official site of Fast Track.

    Reference: Pinal Dave (http://blog.SQLAuthority.com)

    SQLAuthority News – Free eBook Download – Introducing Microsoft SQL Server 2008 R2

    Microsoft Press has published FREE eBook on the most awaiting release of SQL Server 2008 R2. The book is written by Ross Mistry (@rossmistry) and Stacia Misner. Ross is my personal friend and one of the most active book writer in SQL Server Domain. When I see his name on any book, I am sure that it will be high quality and easy to read book.

    Ross has written many other books, which I am big fan of as well.

    The details about the book is here:

    Introducing Microsoft SQL Server 2008 R2, by Ross Mistry and Stacia Misner

    The book contains 10 chapters and 216 pages.

    PART I   Database Administration

    CHAPTER 1   SQL Server 2008 R2 Editions and Enhancements
    CHAPTER 2   Multi-Server Administration
    CHAPTER 3   Data-Tier Applications
    CHAPTER 4   High Availability and Virtualization Enhancements
    CHAPTER 5   Consolidation and Monitoring

    PART II   Business Intelligence Development

    CHAPTER 6   Scalable Data Warehousing
    CHAPTER 7   Master Data Services
    CHAPTER 8   Complex Event Processing with StreamInsight
    CHAPTER 9   Reporting Services Enhancements
    CHAPTER 10   Self-Service Analysis with PowerPivot

    More detail about the book is listed here.

    You can download the ebook in XPS format here and in PDF format here.

    Reference: Pinal Dave (http://blog.SQLAuthority.com)

    SQL SERVER – What is MDS? – Master Data Services in Microsoft SQL Server 2008 R2

    What is MDS?

    Master Data Services helps enterprises standardize the data people rely on to make critical business decisions. With Master Data Services, IT organizations can centrally manage critical data assets company wide and across diverse systems, enable more people to securely manage master data directly, and ensure the integrity of information over time. (Source: Replace with Microsoft)

    Today I will be talking about the same subject at Microsoft TechEd India. If you want to learn about how to standardize your data and apply the business rules to validate data you must attend my session. MDS is very interesting concept, I will cover super short but very interesting 10 quick slides about this subject. I will make sure in very first 20 mins, you will understand following topics

    • Introduction to Master Data Management
      • What is Master Data and Challenges
      • MDM Challenges and Advantage
    • Microsoft Master Data Services
      • Benefits and Key Features
      • Uses of MDS
      • Capabilities
    • Key Features of MDS

    This slides decks will be followed by around 30 mins demo which will have story of entity, hierarchies, versions, security, consolidation and collection. I will be tell this story keeping business rules in center. We take one business rule which will be simple validation rule and will make it much more complex and yet very useful to product.

    I will also demonstrate few real life scenario where I will be talking about MDS and its usage.

    Do not miss this session. At the end of session there will be book awarded to best participant.

    My session details:

    Session: Master Data Services in Microsoft SQL Server 2008 R2
    Date: April 12, 2010  Time: 2:30pm-3:30pm

    SQL Server Master Data Services will ship with SQL Server 2008 R2 and will improve Microsoft’s platform appeal. This session provides an in depth demonstration of MDS features and highlights important usage scenarios. Master Data Services enables consistent decision making by allowing you to create, manage and propagate changes from single master view of your business entities. Also with MDS – Master Data-hub which is the vital component helps ensure reporting consistency across systems and deliver faster more accurate results across the enterprise. We will talk about establishing the basis for a centralized approach to defining, deploying, and managing master data in the enterprise.

    Reference: Pinal Dave (http://blog.SQLAuthority.com)

    SQLAuthority News – Interesting Whitepaper – We Loaded 1TB in 30 Minutes with SSIS, and So Can You

    We Loaded 1TB in 30 Minutes with SSIS, and So Can You
    SQL Server Technical Article
    Writers: Len Wyatt, Tim Shea, David Powell
    Published: March 2009

    In February 2008, Microsoft announced a record-breaking data load using Microsoft SQL Server Integration Services (SSIS): 1 TB of data in less than 30 minutes. That data load, using SQL Server Integration Services, was 30% faster than the previous best time using a commercial ETL tool. This paper outlines what it took: the software, hardware, and configuration used. We will describe what we did to achieve that result, and offer suggestions for how to relate these techniques to typical scenarios. Even for customers who don’t have needs quite like this benchmark, such efforts can teach a lot about getting optimal performance.

    Read the white paper here.

    Abstract courtesy Microsoft

    Reference : Pinal Dave (http://blog.SQLAuthority.com)