Niraj Bhatt works as an Enterprise Architect for a Fortune 500 company and has an innate passion for building / studying software systems. He is a top rated speaker at various technical forums including Tech·Ed, MCT Summit, Developer Summit, and Virtual Tech Days, among others. Having run a successful startup for four years Niraj enjoys working on – IT innovations that can impact an enterprise bottom line, streamlining IT budgets through IT consolidation, architecture and integration of systems, performance tuning, and review of enterprise applications. He has received Microsoft MVP award for ASP.NET, Connected Systems and most recently on Windows Azure. When he is away from his laptop, you will find him taking deep dives in automobiles, pottery, rafting, photography, cooking and financial statements though not necessarily in that order. He is also a manager/speaker at BDOTNET, Asia’s largest .NET user group.
Here is the guest post by Niraj Bhatt.
As data in your applications grows it’s the database that usually becomes a bottleneck. It’s hard to scale a relational DB and the preferred approach for large scale applications is to create separate databases for writes and reads. These databases are referred as transactional database and reporting database. Though there are tools / techniques which can allow you to create snapshot of your transactional database for reporting purpose, sometimes they don’t quite fit the reporting requirements of an enterprise. These requirements typically are data analytics, effective schema (for an Information worker to self-service herself), historical data, better performance (flat data, no joins) etc. This is where a need for data warehouse or an OLAP system arises.
A Key point to remember is a data warehouse is mostly a relational database. It’s built on top of same concepts like Tables, Rows, Columns, Primary keys, Foreign Keys, etc. Before we talk about how data warehouses are typically structured let’s understand key components that can create a data flow between OLTP systems and OLAP systems. There are 3 major areas to it:
a) OLTP system should be capable of tracking its changes as all these changes should go back to data warehouse for historical recording. For e.g. if an OLTP transaction moves a customer from silver to gold category, OLTP system needs to ensure that this change is tracked and send to data warehouse for reporting purpose. A report in context could be how many customers divided by geographies moved from sliver to gold category. In data warehouse terminology this process is called Change Data Capture. There are quite a few systems that leverage database triggers to move these changes to corresponding tracking tables. There are also out of box features provided by some databases e.g. SQL Server 2008 offers Change Data Capture and Change Tracking for addressing such requirements.
b) After we make the OLTP system capable of tracking its changes we need to provision a batch process that can run periodically and takes these changes from OLTP system and dump them into data warehouse. There are many tools out there that can help you fill this gap – SQL Server Integration Services happens to be one of them.
c) So we have an OLTP system that knows how to track its changes, we have jobs that run periodically to move these changes to warehouse. The question though remains is how warehouse will record these changes? This structural change in data warehouse arena is often covered under something called Slowly Changing Dimension (SCD). While we will talk about dimensions in a while, SCD can be applied to pure relational tables too. SCD enables a database structure to capture historical data. This would create multiple records for a given entity in relational database and data warehouses prefer having their own primary key, often known as surrogate key.
As I mentioned a data warehouse is just a relational database but industry often attributes a specific schema style to data warehouses. These styles are Star Schema or Snowflake Schema. The motivation behind these styles is to create a flat database structure (as opposed to normalized one), which is easy to understand / use, easy to query and easy to slice / dice. Star schema is a database structure made up of dimensions and facts. Facts are generally the numbers (sales, quantity, etc.) that you want to slice and dice. Fact tables have these numbers and have references (foreign keys) to set of tables that provide context around those facts. E.g. if you have recorded 10,000 USD as sales that number would go in a sales fact table and could have foreign keys attached to it that refers to the sales agent responsible for sale and to time table which contains the dates between which that sale was made. These agent and time tables are called dimensions which provide context to the numbers stored in fact tables. This schema structure of fact being at center surrounded by dimensions is called Star schema. A similar structure with difference of dimension tables being normalized is called a Snowflake schema.
This relational structure of facts and dimensions serves as an input for another analysis structure called Cube. Though physically Cube is a special structure supported by commercial databases like SQL Server Analysis Services, logically it’s a multidimensional structure where dimensions define the sides of cube and facts define the content. Facts are often called as Measures inside a cube. Dimensions often tend to form a hierarchy. E.g. Product may be broken into categories and categories in turn to individual items. Category and Items are often referred as Levels and their constituents as Members with their overall structure called as Hierarchy. Measures are rolled up as per dimensional hierarchy. These rolled up measures are called Aggregates. Now this may seem like an overwhelming vocabulary to deal with but don’t worry it will sink in as you start working with Cubes and others.
Let’s see few other terms that we would run into while talking about data warehouses.
ODS or an Operational Data Store is a frequently misused term. There would be few users in your organization that want to report on most current data and can’t afford to miss a single transaction for their report. Then there is another set of users that typically don’t care how current the data is. Mostly senior level executives who are interesting in trending, mining, forecasting, strategizing, etc. don’t care for that one specific transaction. This is where an ODS can come in handy. ODS can use the same star schema and the OLAP cubes we saw earlier. The only difference is that the data inside an ODS would be short lived, i.e. for few months and ODS would sync with OLTP system every few minutes. Data warehouse can periodically sync with ODS either daily or weekly depending on business drivers.
Data marts are another frequently talked about topic in data warehousing. They are subject-specific data warehouse. Data warehouses that try to span over an enterprise are normally too big to scope, build, manage, track, etc. Hence they are often scaled down to something called Data mart that supports a specific segment of business like sales, marketing, or support. Data marts too, are often designed using star schema model discussed earlier. Industry is divided when it comes to use of data marts. Some experts prefer having data marts along with a central data warehouse. Data warehouse here acts as information staging and distribution hub with spokes being data marts connected via data feeds serving summarized data. Others eliminate the need for a centralized data warehouse citing that most users want to report on detailed data.
Reference: Pinal Dave (https://blog.sqlauthority.com)
10 Comments. Leave new
Great Post very usefull for the people who are intrested in DW like me.. Thanks a lot. Can u give me some more detailed information about Change Data Capture…
I need even Operational Data Source too..
Thanks a lot.
I have a situation like
customer
vehicle
service
sale
i have to load all corresponding data always.. so if any changes occurs in any one of the tables how would i detect the changes and load the corresponding 4 tables data using CDC
Very good detailed article, Good explanation on the concept of ODS and how to sync up the ODS with OLTP/OLAP.
Thank you Niraj.
Enable proactive catching mode to ROLAP. now your server can listen for data change notifications and update dimentions and measures real time just like an auto pilot mode.
HI pinal,
From last 2 weeks I am stuck with the issue of not being able to retreive column information in OLEDB data source task of data flow tool in SSIS.
I am using Microsoft sql server 2008, I tried all the solutions available on different SSIS FORUMS, but every time I get the same error
Error at Data Flow Task[OLEDB source[449]]:No colum information was returned by the sql command
I am using the following batch of sql statments to retrieve the server level configuration of all servers in my company. The table variable @tb1_SvrStng has 83 columns and it is populated using different resources.
So I summarize the sql script. I cannot use it as stored procedure because this script is going to run against 14 servers (once for each server). So if I store the procedure on one server, other server cannot execute that procedure in its context.
Please help me to solve my problem. I will highly appreciate your help. I am not using any temporary table in my script.
declare @tb1_SvrStng table
(
srvProp_MachineName varchar(50),
srvProp_BldClrVer varchar(50),
srvProp_Collation varchar(50),
srvProp_CNPNB varchar(100),
…
xpmsver_ProdVer varchar(50),
….. .
syscnfg_UsrCon_cnfgVal int,
…..
);
insert into @tb1_SvrStng
(
srvProp_BldClrVer,
srvProp_Collation,
srvProp_CNPNB , ……..
…….. .
)
select convert(varchar, serverproperty(‘BuildClrVer’)),
convert(varchar, serverproperty(‘Collation’))
……..
…….
declare @temp_msver1 table
(
id int, name varchar(100),
………..
);
insert into @temp_msver1 exec xp_msver
Update @tb1_SvrStng
set xpmsver_ProdVer =
(
select value from @temp_msver1 where name = ‘ProductVersion’
),
xpmsver_Platform =
(
select value from @temp_msver1 where name = ‘Platform’
),
…..
……
select
srvProp_SerName as srvProp_SerName,
getdate() as reportDateTime,
srvProp_BldClrVer as srvProp_BldClrVer,
srvProp_Collation as srvProp_Collation,
…..
…..
from @tb1_SvrStng
I will highly appretiate your help.
Thanks
Jasdeep
Jasdeep,
Is your issue resolved?.
Regards.
Krishna.
Hello Neeraj,
Thanks for nice article, I also want one suggesion.
I have been given a task to merge three databases and create one database. We have three version of product in three databases and I am planning to merge these three databases and partition database on the basis of version.This will create again three ndf files.
But..
Can I use concept of datamart instead of data ware housing? I want to create one database top of three databases and this newly created databases can act as an interface for three databases.
This parent database should show one logical database containing these three version databases.
Example: I have tableA in three database, when I query parent database(select * from tableA), It should show union all of three tableA with version column(like CMS).
My each tableA contains 120 Million of data, I also want good preformance.
In other words my mdf file can act equivalent to ndf file. I hope you can give any solution for this.
Thanks in advance,
Naveen
Nice write up, you have a good knowledge about data warehouse and its various other related traits. It was truly a knowledge booster.
hi Pinal,
IS that transactional reports taken only in ODS not in DWH. If so why cant we pull transactional reports from DWH.