SQL SERVER – Who needs ETL Version Control?

While making some changes (read: mistakes) to my ETL business logic the other day, it occurred to me much too late that those unfortunate changes had replaced the once properly working logic with now very flawed logic.  The good news was that I remembered what the working logic was supposed to be.  The bad news, I had to re-create it.  Had I had the working logic already checked-in under version control, I could have saved myself the two hours of wasted time and effort.  In an ETL team development setting, these types of issues could easily multiply and significantly impede developer progress and productivity.

Version control makes it possible to have multiple people developing together and working on same project.  If I am working on something, a fellow ETL developer will not be able to check it out and make changes to it until I am finished with my changes and check those back in.   This addresses the common situation where one developer’s project artifact and code changes clobber that of another developer by accident.

Another reason to have version control is that it gives the developer confidence to try different modifications to the logic without having to worry about a negative impact to the previously developed logic.   Sure, I could always save off my work and create a copy to try things out on.  However, in my experience, I would wind up with various different copies all over the place and it would then become difficult to keep track of what’s what.  Version control allows me to capture and document relevant information about whatever I am checking in.

Finally, version control maintains a history of changes that have been made to the application logic.  In some environments, this information and traceability can prove extremely valuable if not necessary.   Ever wonder when a bug was introduced into the application logic?

etlversion1 SQL SERVER   Who needs ETL Version Control?

Click to Enlarge

The expressor Studio desktop ETL tool allows developers to seamlessly check-in and -out project artifacts to expressor’s Repository, which supports version control and is available as part of their Standard Edition product offering.

Many data integration solutions do not have built-in version control.   Most offer some way to interface to version control systems, but expressor has it built-in with their Standard Edition product offering.   I requested a trial license to play with it and was pleasantly surprised at expressor’s seamless version control integration.  There was no need to do any separate 3rd party download, install, and configuration.  With expressor, check-in and check-out was almost transparent.  As someone who has developed ETL solutions, I really appreciated the clean and simple approach expressor has taken.   I have no doubt this would help make any ETL project go smoother and faster.

You can request a free 30-day trial of the expressor Standard Edition, or download their free expressor Studio community ETL tool at their site.

Reference:  Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – Introduction to Adaptive ETL Tool – How adaptive is your ETL?

I am often reminded by the fact that BI/data warehousing infrastructure is very brittle and not very adaptive to change. There are lots of basic use cases where data needs to be frequently loaded into SQL Server or another database. What I have found is that as long as the sources and targets stay the same, SSIS or any other ETL tool for that matter does a pretty good job handling these types of scenarios.

But what happens when you are faced with more challenging scenarios, where the data formats and possibly the data types of the source data are changing from customer to customer?  Let’s examine a real life situation where a health management company receives claims data from their customers in various source formats. Even though this company supplied all their customers with the same claims forms, they ended up building one-off ETL applications to process the claims for each customer.

Why, you ask? Well, it turned out that the claims data from various regional hospitals they needed to process had slightly different data formats, e.g. “integer” versus “string” data field definitions.  Moreover the data itself was represented with slight nuances, e.g. “0001124” or “1124” or “0000001124” to represent a particular account number, which forced them, as I eluded above, to build new ETL processes for each customer in order to overcome the inconsistencies in the various claims forms.  As a result, they experienced a lot of redundancy in these ETL processes and recognized quickly that their system would become more difficult to maintain over time.

So imagine for a moment that you could use an ETL tool that helps you abstract the data formats so that your ETL transformation process becomes more reusable. Imagine that one claims form represents a data item as a string – acc_no(varchar) – while a second claims form represents the same data item as an integer – account_no(integer). This would break your traditional ETL process as the data mappings are hard-wired.  But in a world of abstracted definitions, all you need to do is create parallel data mappings to a common data representation used within your ETL application; that is, map both external data fields to a common attribute whose name and type remain unchanged within the application.

acc_no(varchar) is mapped to account_number(integer)

expressor Studio first claim form schema mapping

expressor Studio first claim form schema mapping

account_no(integer) is also mapped to account_number(integer)

expressor Studio second claim form schema mapping

expressor Studio second claim form schema mapping

All the data processing logic that follows manipulates the data as an integer value named account_number.

Well, these are the kind of problems that that the expressor data integration solution automates for you.  I’ve been following them since last year and encourage you to check them out by downloading their free expressor Studio ETL software.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – 4 Tips for ETL Software IDE Developers

In a previous blog, I introduced the notion of Semantic Types. To an end-user, a seamlessly integrated semantic typing engine significantly increases the ease of use of an ETL IDE (integrated development environment, or developer studio). This led me to think about other ease-of-use issues I have encountered while building ETL applications.

When I get stumped while programming, I find myself asking the variations on these questions:

  • “How do I…?”
  • “Now what?”
  • “Why isn’t this working?”
  • “Why do I have to redo the work I just did?”

It seems to me that a good ETL IDE will anticipate these questions and seek to answer them before they are even asked. So here are my tips to help software vendors build developer IDEs that actually make development easier.

How do I…?

While developing an ETL application, have you ever asked yourself: “How do I set up the connection to my SQL Server database?”,“How do I import my table definitions from Access?”, etc. An easy answer might be “read the manual” but sometimes product manuals are not robust or easily accessible. So, integrating robust how-to instructions directly into your ETLstudio would help users get the information they need at the time they need it.

Now what?

IDEs in general know where you last clicked or performed an action using an input device such as a keyboard; so they should be able to reasonably predict the design context you are in and suggest the next steps accordingly. Context-sensitive suggestions based on the state of the user’s work will help users move forward in ETL application development.

Why isn’t this working? Or why do I have to wait till I compile to be told about a critical design issue?

If an ETL IDE is smart enough to signal to users what in their design structures is left to be completed or has been completed incorrectly, then the developer can spend much less time in the designàcompileàerror-correct loop. Just-in-time validation helps users detect and correct programming errors earlier in the ETL development life cycle.

Why do I have to redo the work I just did?

In ETL development, schemas, transformation rules, connectivity objects, etc., can be reused in various situations. Using mouse-clicks to build and manage libraries of reusable design objects implies that the application development effort should decrease over time and as the library acquires more objects.

I met a great company at SQL Pass that is trying to address many of these usability issues. Check them out at Expressor.

What other ease-of-use suggestions do you have for ETL software vendors? Please post your valuable comments.

Reference: Pinal Dave (http://blog.SQLAuthority.com)