SQL SERVER – Advanced Data Quality Services with Melissa Data – Azure Data Market

There has been much fanfare over the new SQL Server 2012, and especially around its new companion product Data Quality Services (DQS). Among the many new features is the addition of this integrated knowledge-driven product that enables data stewards everywhere to profile, match, and cleanse data. In addition to the homegrown rules that data stewards can design and implement, there are also connectors to third party providers that are hosted in the Azure Datamarket marketplace.  In this review, I leverage SQL Server 2012 Data Quality Services, and proceed to subscribe to a third party data cleansing product through the Datamarket to showcase this unique capability.

Crucial Questions

For the purposes of the review, I used a database I had in an Excel spreadsheet with name and address information. Upon a cursory inspection, there are miscellaneous problems with these records; some addresses are missing ZIP codes, others missing a city, and some records are slightly misspelled or have unparsed suites. With DQS, I can easily add a knowledge base to help standardize my values, such as for state abbreviations. But how do I know that my address is correct? And if my address is not correct, what should it be corrected to? The answer lies in a third party knowledge base by the acknowledged USPS certified address accuracy experts at Melissa Data.

Reference Data Services

Within DQS there is a handy feature to actually add reference data from many different third-party Reference Data Services (RDS) vendors. DQS simplifies the processes of cleansing, standardizing, and enriching data through custom rules and through service providers from the Azure Datamarket. A quick jump over to the Datamarket site shows me that there are a handful of providers that offer data directly through Data Quality Services. Upon subscribing to these services, one can attach a DQS domain or composite domain (fields in a record) to a reference data service provider, and begin using it to cleanse, standardize, and enrich that data. Besides what I am looking for (address correction and enrichment), it is possible to subscribe to a host of other services including geocoding, IP address reference, phone checking and enrichment, as well as name parsing, standardization, and genderization.  These capabilities extend the data quality that DQS has natively by quite a bit.

For my current address correction review, I needed to first sign up to a reference data provider on the Azure Data Market site. For this example, I used Melissa Data’s Address Check Service. They offer free one-month trials, so if you wish to follow along, or need to add address quality to your own data, I encourage you to sign up with them.

Once I subscribed to the desired Reference Data Provider, I navigated my browser to the Account Keys within My Account to view the generated account key, which I then inserted into the DQS Client – Configuration under the Administration area.

Step by Step to Guide

That was all it took to hook in the subscribed provider -Melissa Data- directly to my DQS Client. The next step was for me to attach and map in my Reference Data from the newly acquired reference data provider, to a domain in my knowledge base.

On the DQS Client home screen, I selected “New Knowledge Base” under Knowledge Base Management on the left-hand side of the home screen.

Under New Knowledge Base, I typed a Name and description of my new knowledge base, then proceeded to the Domain Management screen.

Here I established a series of domains (fields) and then linked them all together as a composite domain (record set). Using the Create Domain button, I created the following domains according to the fields in my incoming data:

  1. Name
  2. Address
  3. Suite
  4. City
  5. State
  6. Zip

I added a Suite column in my domain because Melissa Data has the ability to return missing Suites based on last name or company. And that’s a great benefit of using these third party providers, as they have data that the data steward would not normally have access to. The bottom line is, with these third party data providers, I can actually improve my data.

Next, I created a composite domain (fulladdress) and added the (field) domains into the composite domain. This essentially groups our address fields together in a record to facilitate the full address cleansing they perform.

I then selected my newly created composite domain and under the Reference Data tab, added my third party reference data provider –Melissa Data’s Address Check- and mapped in each domain that I had to the provider’s Schema.

Now that my composite domain has been married to the Reference Data service, I can take the newly published knowledge base and create a project to cleanse and enrich my data.

My next task was to create a new Data Quality project, mapping in my data source and matching it to the appropriate domain column, and then kick off the verification process. It took just a few minutes with some progress indicators indicating that it was working.

When the process concluded, there was a helpful set of tabs that place the response records into categories: suggested; new; invalid; corrected (automatically); and correct. Accepting the suggestions provided by  Melissa Data allowed me to clean up all the records and flag the invalid ones. It is very apparent that DQS makes address data quality simplistic for any IT professional.

Final Note

As I have shown, DQS makes data quality very easy. Within minutes I was able to set up a data cleansing and enrichment routine within my data quality project, and ensure that my address data was clean, verified, and standardized against real reference data. As reviewed here, it’s easy to see how both SQL Server 2012 and DQS work to take what used to require a highly skilled developer, and empower an average business or database person to consume external services and clean data.

Reference: Pinal Dave (http://blog.sqlauthority.com)

About these ads

SQL SERVER – Why Do We Need Data Quality Services – Importance and Significance of Data Quality Services (DQS)

Databases are awesome.  I’m sure my readers know my opinion about this – I have made SQL Server my life’s work after all!  I love technology and all things computer-related.  Of course, even with my love for technology, I have to admit that it has its limits.  For example, it takes a human brain to notice that data has been input incorrectly.  Computer “brains” might be faster than humans, but human brains are still better at pattern recognition.  For example, a human brain will notice that “300” is a ridiculous age for a human to be, but to a computer it is just a number.  A human will also notice similarities between “P. Dave” and “Pinal Dave,” but this would stump most computers.

In a database, these sorts of anomalies are incredibly important.  Databases are often used by multiple people who rely on this data to be true and accurate, so data quality is key.  That is why the improved SQL Server features Master Data Management talks about Data Quality Services.  This service has the ability to recognize and flag anomalies like out of range numbers and similarities between data.  This allows a human brain with its pattern recognition abilities to double-check and ensure that P. Dave is the same as Pinal Dave.

A nice feature of Data Quality Services is that once you set the rules for the program to follow, it will not only keep your data organized in the future, but go to the past and “fix up” any data that has already been entered.  It also allows you do combine data from multiple places and it will apply these rules across the board, so that you don’t have any weird issues that crop up when trying to fit a round peg into a square hole.

There are two parts of Data Quality Services that help you accomplish all these neat things.  The first part is DQL Server, which you can think of as the hardware component of the system.  It is installed on the side of (it needs to install separately after SQL Server is installed) SQL Server and runs quietly in the background, performing all its cleanup services.

DQS Client is the user interface that you can interact with to set the rules and check over your data.  There are three main aspects of Client: knowledge base management, data quality projects and administration.  Knowledge base management is the part of the system that allows you to set the rules, or program the “knowledge base,” so that your database is clean and consistent.

Data Quality projects are what run in the background and clean up the data that is already present.  The administration allows you to check out what DQS Client is doing, change rules, and generally oversee the entire process.  The whole process is user-friendly and a pleasure to use.  I highly recommend implementing Data Quality Services in your database.

Here are few of my blog posts which are related to Data Quality Services and I encourage you to try this out.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – Configuring Interactive Cleansing Suggestion Min Score for Suggestions in Data Quality Services (DQS) – Sensitivity of Suggestion

Earlier I talked about what kind of questions, I do not like when I get asked. Today we will go over the question which I like when I get asked the same. One of the reader practices various steps in my earlier blog post Step by Step Guide to Beginning Data Quality Services in SQL Server 2012 – Introduction to DQS. While reading the blog post he noticed that Data Quality Services is not providing very helpful suggestions. He wrote an email to me about it. Let us go over his email.

“Pinal,

I noticed in one of your images that DQS is not providing very helpful suggestions. First of all DQS should be able to make intelligent guesses and make the necessary correction by itself. If it cannot do the same, in that case, it should give us intelligent suggestions but in the image included here, I see the suggestions are not there as well.

Why is it so? Would you please tell me how to increase the numbers of suggestion? I do understand this may not be preferable solution in many case but all the business cases go on it depends. There are cases when the high sensitivity required and there are cases when higher sensitivities are not required. I would like to seek your help here.

–Sriram MD”

This is indeed a great question. I see that Sriram understands that every system is different and every application has a different need. I will not have to tell him this most important concept. The question is about how to change the sensitivity of suggestions for correction in DQS. Well, this option is available under the configuration tab in the DQS client.

Once you click on Configuration you will see the following screen. Click the Tab of General Settings. You will see the section of Interactive Cleansing. Under this second there is the first option of “Min score for suggestions”. As this is set to 0.7 every suggestion which matches 0.7 probabilities or higher probability are displayed under the suggestion tab.


You can see in the following image that there is no suggestion as the min score for suggestions is set to 0.7 and there is no record which qualifies to that much confidence.

Now let us change the value of Min Score for suggestion to 0.5.


The lower value increased the confidence of DQS to give further suggestion to values which are over 0.5. However, in our case the suggestions which it provides are also accurate. This may not be true for your sample. Every sample is different so you should manually review it before approving them.

I guess, this is a simple blog post to demonstrate how to change the confidence value for the suggestions which Data Quality Services provides. Use this feature with care and always tune it according to your datasets and record diversity.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

 

 

 

SQL SERVER – Unable to DELETE Project in Data Quality Projects (DQS)

Here is the email which made me write this blog post. When I write a blog post I write keeping in mind that if the developer is not familiar with the concept he will attempt this on the development server. If due to any reason you attempt it on any other server than your personal server, developer should make sure to have complete confidence on his own expertise and understand the risk behind it.  Well, let us read the email which I received. I have modified it a bit to remove information related to organizational and individual.

“I just read your blog post on Beginning DQS. I went ahead and followed every single screenshot and it worked fine. I was able to execute the DQS project successfully. However, the same blog post got me in trouble – a serious trouble.

After first successful deployment I went ahead and created a few of my own knowledge base and projects. I played around a bit and then decided to get back to real work. Now we had deployed DQS on production server only, so experiment on production server. Now, when I got back to my work, I forgot to close all the windows. My manager found the window open and have seen my test projects. He has asked me to delete my experiments immediately and have said words which I cannot write to you.

Here is the problem. I am not able to delete the project which I have created earlier. I am able to open it and play with it but the delete option is disabled and grayed out (see attached image). Now I believe there is nothing wrong with this project as it was just a test project. Would you please write to my manager that it is not harmful to leave that project there as it is? It is also not using any resources. I think he will believe you.”

As I said this kind of email makes me uncomfortable. I do not want someone to execute anything on production server. I often write notes and disclaimer on my post when something is dangerous to execute on production server. However, if someone is not expert with SQL Server and attempts something new on production server, I think the major issue is here with the person (admin) who gave new developer permission to production server. This has to be carefully avoided. Here was my response to the individual.

“I cannot write to your manager anything as he has not asked me anything. Honestly I believe he is correct in his behavior as you should have not executed anything on the production server without prior approval and testing on the development server. Any R&D must be done on local box or development box. I suggest you request your manager to prevent access to users who does not need access. If he is a good manager, he might have already implemented by now recent event.

I also see your screenshot. Here is the issue: While you were playing with project, you might have closed the project half the way, without completing it. Due to the same reason it is locked. You can open and continue from the same place where you have left the project. If you do not need the project any more. Right click on it, click on unlock the project. This will enable the DELETE option and now you can delete the project.

Next time, be safe out there. It may be dangerous to have admin access to production server when not needed.

I have yet not heard from him but I believe he will take my words positively.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – Step by Step Guide to Beginning Data Quality Services in SQL Server 2012 – Introduction to DQS

Data Quality Services is a very important concept of SQL Server. I have recently started to explore the same and I am really learning some good concepts. Here are two very important blog posts which one should go over before continuing this blog post.

This article is introduction to Data Quality Services for beginners. We will be using an Excel file

Click on the image to enlarge the it.

In the first article we learned to install DQS. In this article we will see how we can learn about building Knowledge Base and using it to help us identify the quality of the data as well help correct the bad quality of the data.

Here are the two very important steps we will be learning in this tutorial.

  • Building a New Knowledge Base 
  • Creating a New Data Quality Project

Let us start the building the Knowledge Base. Click on New Knowledge Base.

In our project we will be using the Excel as a knowledge base. Here is the Excel which we will be using. There are two columns. One is Colors and another is Shade. They are independent columns and not related to each other. The point which I am trying to show is that in Column A there are unique data and in Column B there are duplicate records.


Clicking on New Knowledge Base will bring up the following screen. Enter the name of the new knowledge base.

Clicking NEXT will bring up following screen where it will allow to select the EXCE file and it will also let users select the source column. I have selected Colors and Shade both as a source column.

Creating a domain is very important. Here you can create a unique domain or domain which is compositely build from Colors and Shade.

As this is the first example, I will create unique domain – for Colors I will create domain Colors and for Shade I will create domain Shade.

Here is the screen which will demonstrate how the screen will look after creating domains.

Clicking NEXT it will bring you to following screen where you can do the data discovery. Clicking on the START will start the processing of the source data provided.

Pre-processed data will show various information related to the source data. In our case it shows that Colors column have unique data whereas Shade have non-unique data and unique data rows are only two.

In the next screen you can actually add more rows as well see the frequency of the data as the values are listed unique.

Clicking next will publish the knowledge base which is just created.

Now the knowledge base is created. We will try to take any random data and attempt to do DQS implementation over it. I am using another excel sheet here for simplicity purpose. In reality you can easily use SQL Server table for the same.

Click on New Data Quality Project to see start DQS Project.

In the next screen it will ask which knowledge base to use. We will be using our Colors knowledge base which we have recently created.

In the Colors knowledge base we had two columns – 1) Colors and 2) Shade. In our case we will be using both of the mappings here. User can select one or multiple column mapping over here.

Now the most important phase of the complete project. Click on Start and it will make the cleaning process and shows various results.

In our case there were two columns to be processed and it completed the task with necessary information. It demonstrated that in Colors columns it has not corrected any value by itself but in Shade value there is a suggestion it has. We can train the DQS to correct values but let us keep that subject for future blog posts.

Now click next and keep the domain Colors selected left side. It will demonstrate that there are two incorrect columns which it needs to be corrected. Here is the place where once corrected value will be auto-corrected in future.

I manually corrected the value here and clicked on Approve radio buttons. As soon as I click on Approve buttons the rows will be disappeared from this tab and will move to Corrected Tab. If I had rejected tab it would have moved the rows to Invalid tab as well.

In this screen you can see how the corrected 2 rows are demonstrated. You can click on Correct tab and see previously validated 6 rows which passed the DQS process.

Now let us click on the Shade domain on the left side of the screen. This domain shows very interesting details as there DQS system guessed the correct answer as Dark with the confidence level of 77%. It is quite a high confidence level and manual observation also demonstrate that Dark is the correct answer. I clicked on Approve and the row moved to corrected tab.

On the next screen DQS shows the summary of all the activities. It also demonstrates how the correction of the quality of the data was performed. The user can explore their data to a SQL Server Table, CSV file or Excel.

The user also has an option to either explore data and all the associated cleansing info or data only. I will select Data only for demonstration purpose.

Clicking explore will generate the files.

Let us open the generated file. It will look as following and it looks pretty complete and corrected.

Well, we have successfully completed DQS Process. The process is indeed very easy. I suggest you try this out yourself and you will find it very easy to learn. In future we will go over advanced concepts.

Are you using this feature on your production server? If yes, would you please leave a comment with your environment and business need. It will be indeed interesting to see where it is implemented.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – DQS Error – Cannot connect to server – A .NET Framework error occurred during execution of user-defined routine or aggregate “SetDataQualitySessions” – SetDataQualitySessionPhaseTwo

Earlier I wrote a blog post about how to install DQS in SQL Server 2012. Today I decided to write a second part of this series where I explain how to use DQS, however, as soon as I started the DQS client, I encountered an error that will not let me pass through and connect with DQS client. It was a bit strange to me as everything was functioning very well when I left it last time.  The error was very big but here are the first few words of it.

Cannot connect to server.
A .NET Framework error occurred during execution of user-defined routine or aggregate “SetDataQualitySessions”:
System.Data.SqlClient.SqlException (0x80131904): A .NET Framework error occurred during execution of user-defined routine or aggregate “SetDataQualitySessionPhaseTwo”:

The error continues – here is the quick screenshot of the error.

As my initial attempts could not fix the error I decided to search online and I finally received a wonderful solution from Microsoft Site. The error has happened due to latest update I had installed on .NET Framework 4. There was a  mismatch between the Module Version IDs (MVIDs) of the SQL Common Language Runtime (SQLCLR) assemblies in the SQL Server 2012 database and the Global Assembly Cache (GAC). This mismatch was to be resolved for the DQS to work properly.

The workaround is specified here in detail. Scroll to subtopic 4.23 Some .NET Framework 4 Updates Might Cause DQS to Fail.

The script was very much straight forward.

Here are the few things to not to miss while applying workaround.

  • Make sure DQS client is properly closed
  • The NETAssemblies is based on your OS.
  • NETAssemblies for 64 bit machine – which is my machine is “c:\windows\Microsoft.NET\Framework64\v4.0.30319″. If you have Winodws installed on any other drive other than c:\windows do not forget to change that in the above path. Additionally if you have 32 bit version installed on c:\windows you should use path as “c:\windows\Microsoft.NET\Framework\v4.0.30319″
  • Make sure that you execute the script specified in 4.23 sections in this article in the database DQS_MAIN. Do not run this in the master database as this will not fix your error.
  • Do not forget to restart your SQL Services once above script has been executed.
  • Once you open the client it will work this time.

Here is the script which I have bit modified from original script. I strongly suggest that you use original script mentioned 4.23 sections. However, this one is customized my own machine.

/*
Original source: http://bit.ly/PXX4NE (Technet)
Modifications:
-- Added Database context
-- Added environment variable @NETAssemblies
-- Main script modified to use @NETAssemblies
*/
USE DQS_MAIN
GO
BEGIN
-- Set your environment variable
-- assumption - Windows is installed in c:\windows folder
DECLARE @NETAssemblies NVARCHAR(200)
-- For 64 bit uncomment following line
SET @NETAssemblies = 'c:\windows\Microsoft.NET\Framework64\v4.0.30319\'
-- For 32 bit uncomment following line
-- SET @NETAssemblies = 'c:\windows\Microsoft.NET\Framework\v4.0.30319\'
DECLARE @AssemblyName NVARCHAR(200), @RefreshCmd NVARCHAR(200), @ErrMsg NVARCHAR(200)
DECLARE ASSEMBLY_CURSOR CURSOR FOR
SELECT
name AS NAME
FROM sys.assemblies
WHERE name NOT LIKE '%ssdqs%'
AND name NOT LIKE '%microsoft.sqlserver.types%'
AND name NOT LIKE '%practices%'
AND name NOT LIKE '%office%'
AND name NOT LIKE '%stdole%'
AND name NOT LIKE '%Microsoft.Vbe.Interop%'
OPEN ASSEMBLY_CURSOR
FETCH NEXT FROM ASSEMBLY_CURSOR
INTO @AssemblyName
WHILE @@FETCH_STATUS = 0
BEGIN
BEGIN
TRY
SET @RefreshCmd = 'ALTER ASSEMBLY [' + @AssemblyName + '] FROM ''' +
@NETAssemblies + @AssemblyName + '.dll' +
''' WITH PERMISSION_SET = UNSAFE'
EXEC sp_executesql @RefreshCmd
PRINT 'Successfully upgraded assembly ''' + @AssemblyName + ''''
END TRY
BEGIN CATCH
IF ERROR_NUMBER() != 6285
BEGIN
SET
@ErrMsg = ERROR_MESSAGE()
PRINT 'Failed refreshing assembly ' + @AssemblyName + '. Error message: ' + @ErrMsg
END
END
CATCH
FETCH NEXT FROM ASSEMBLY_CURSOR
INTO @AssemblyName
END
CLOSE
ASSEMBLY_CURSOR
DEALLOCATE ASSEMBLY_CURSOR
END
GO

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – Installing Data Quality Services (DQS) on SQL Server 2012

Data Quality Services is very interesting enhancements in SQL Server 2012. My friend and SQL Server Expert Govind Kanshi have written an excellent article on this subject earlier on his blog. Yesterday I stumbled upon his blog one more time and decided to experiment myself with DQS. I have basic understanding of DQS and MDS so I knew I need to start with DQS Client. However, when I tried to find DQS Client I was not able to find it under SQL Server 2012 installation.

I quickly realized that I needed to separately install the DQS client. You will find the DQS installer under SQL Server 2012 >> Data Quality Services directory.

The pre-requisite of DQS is Master Data Services (MDS) and IIS. If you have not installed IIS, you can follow the simple steps and install IIS in your machine.

Once the pre-requisites are installed, click on MDS installer once again and it will install DQS just fine.

Be patient with the installer as it can take a bit longer time if your machine is low on configurations.

Once the installation is over you will be able to expand SQL Server 2012 >> Data Quality Services directory and you will notice that it will have a new item called Data Quality Client.  Click on it and it will open the client.

Well, in future blog post we will go over more details about DQS and detailed practical examples.

Reference: Pinal Dave (http://blog.sqlauthority.com)