SQL SERVER – Advanced Data Quality Services with Melissa Data – Azure Data Market

There has been much fanfare over the new SQL Server 2012, and especially around its new companion product Data Quality Services (DQS). Among the many new features is the addition of this integrated knowledge-driven product that enables data stewards everywhere to profile, match, and cleanse data. In addition to the homegrown rules that data stewards can design and implement, there are also connectors to third party providers that are hosted in the Azure Datamarket marketplace.  In this review, I leverage SQL Server 2012 Data Quality Services, and proceed to subscribe to a third party data cleansing product through the Datamarket to showcase this unique capability.

Crucial Questions

For the purposes of the review, I used a database I had in an Excel spreadsheet with name and address information. Upon a cursory inspection, there are miscellaneous problems with these records; some addresses are missing ZIP codes, others missing a city, and some records are slightly misspelled or have unparsed suites. With DQS, I can easily add a knowledge base to help standardize my values, such as for state abbreviations. But how do I know that my address is correct? And if my address is not correct, what should it be corrected to? The answer lies in a third party knowledge base by the acknowledged USPS certified address accuracy experts at Melissa Data.

Reference Data Services

Within DQS there is a handy feature to actually add reference data from many different third-party Reference Data Services (RDS) vendors. DQS simplifies the processes of cleansing, standardizing, and enriching data through custom rules and through service providers from the Azure Datamarket. A quick jump over to the Datamarket site shows me that there are a handful of providers that offer data directly through Data Quality Services. Upon subscribing to these services, one can attach a DQS domain or composite domain (fields in a record) to a reference data service provider, and begin using it to cleanse, standardize, and enrich that data. Besides what I am looking for (address correction and enrichment), it is possible to subscribe to a host of other services including geocoding, IP address reference, phone checking and enrichment, as well as name parsing, standardization, and genderization.  These capabilities extend the data quality that DQS has natively by quite a bit.

For my current address correction review, I needed to first sign up to a reference data provider on the Azure Data Market site. For this example, I used Melissa Data’s Address Check Service. They offer free one-month trials, so if you wish to follow along, or need to add address quality to your own data, I encourage you to sign up with them.

Once I subscribed to the desired Reference Data Provider, I navigated my browser to the Account Keys within My Account to view the generated account key, which I then inserted into the DQS Client – Configuration under the Administration area.

Step by Step to Guide

That was all it took to hook in the subscribed provider -Melissa Data- directly to my DQS Client. The next step was for me to attach and map in my Reference Data from the newly acquired reference data provider, to a domain in my knowledge base.

On the DQS Client home screen, I selected “New Knowledge Base” under Knowledge Base Management on the left-hand side of the home screen.

Under New Knowledge Base, I typed a Name and description of my new knowledge base, then proceeded to the Domain Management screen.

Here I established a series of domains (fields) and then linked them all together as a composite domain (record set). Using the Create Domain button, I created the following domains according to the fields in my incoming data:

  1. Name
  2. Address
  3. Suite
  4. City
  5. State
  6. Zip

I added a Suite column in my domain because Melissa Data has the ability to return missing Suites based on last name or company. And that’s a great benefit of using these third party providers, as they have data that the data steward would not normally have access to. The bottom line is, with these third party data providers, I can actually improve my data.

Next, I created a composite domain (fulladdress) and added the (field) domains into the composite domain. This essentially groups our address fields together in a record to facilitate the full address cleansing they perform.

I then selected my newly created composite domain and under the Reference Data tab, added my third party reference data provider –Melissa Data’s Address Check- and mapped in each domain that I had to the provider’s Schema.

Now that my composite domain has been married to the Reference Data service, I can take the newly published knowledge base and create a project to cleanse and enrich my data.

My next task was to create a new Data Quality project, mapping in my data source and matching it to the appropriate domain column, and then kick off the verification process. It took just a few minutes with some progress indicators indicating that it was working.

When the process concluded, there was a helpful set of tabs that place the response records into categories: suggested; new; invalid; corrected (automatically); and correct. Accepting the suggestions provided by  Melissa Data allowed me to clean up all the records and flag the invalid ones. It is very apparent that DQS makes address data quality simplistic for any IT professional.

Final Note

As I have shown, DQS makes data quality very easy. Within minutes I was able to set up a data cleansing and enrichment routine within my data quality project, and ensure that my address data was clean, verified, and standardized against real reference data. As reviewed here, it’s easy to see how both SQL Server 2012 and DQS work to take what used to require a highly skilled developer, and empower an average business or database person to consume external services and clean data.

Reference: Pinal Dave (http://blog.sqlauthority.com)

About these ads

6 thoughts on “SQL SERVER – Advanced Data Quality Services with Melissa Data – Azure Data Market

  1. So, it is something when you add a reference a Data form a source and thus validate info’r in your project ..Such as …If you have a National ID’s Database connected and you are offering a Service by National Id, i can connect to National ID database and verify info’r right?

    Like

    • Hi msuworld, This is something that is not available right now. As of now [13 dec 2012], we can only use reference data sets from Azure DataMarket. In future, the product group on MSDN forums had hinted that we may be able to provide our own reference data sets like the one that you mentioned about National ID. I hope that helps.

      Like

  2. Pinal Sir,
    I am trying to do this since last couple of hours in DQS but am not able to figure it out.
    I have one CSV master file with two columns code,name. and I have one source csv file which has only name. I want to match name in source csv file with name in CSV master file, for matching name I want to generate new csv file with code,name. I know I can do this by simple join on name column in SQL Server but there are lot more things which I want to add to it so I want to do it by DQS. Below is example of my files.
    Master file
    1,Ahmedabad
    2,Bangalore
    3,Delhi

    Source file
    Bangalore

    My output file should look like
    2,Bangalore

    Like

  3. Pingback: SQL SERVER – Weekly Series – Memory Lane – #050 | Journey to SQL Authority with Pinal Dave

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s