[Notes from Pinal]: Everybody knows about ETL Processes. It is a simple process where data is extracted, transformed and loaded into an application or database. However, still today, not many people know about how to do data lineage tracking in ETL processes.
In this 124th episode of the Notes from the Fields series database expert Tim Mitchell (partner at Linchpin People) shares very interesting conversation related to how to get started with the FOR LOOP Container.
Anyone who has worked in data integration for a significant length of time has almost certainly been confronted with the following question:
“Where did this data come from?”
Extract, transform, and load (ETL) operations frequently involve the movement and consolidation of data from multiple sources through numerous transformation steps before being routed to its final resting place. As the ETL touchpoints increase in number and complexity, so does the difficulty in tracking the data back to its origins. The concept of data lineage in ETL is intended to make this process easier. Data lineage tracking involves building ETL elements in such a way that each row of data in the destination tables is unambiguously traceable back to it source, including any transformation process through which it passed during its travels.
Building ETL processes that include data lineage tracking takes extra work and careful planning. Sadly, the majority of ETL processes I’ve found in the wild do not have provisions for capturing data lineage. Because data lineage tracking does not add to the core functions of extraction, transformation, and loading, this design is often skipped during architecture or build. Much like the related ETL best practices of logging and auditing, data lineage is a typically unseen yet still valuable component of a well-designed ETL architecture.
Why Use Data Lineage Tracking?
There are several benefits that are realized with proper ETL data lineage tracking:
- Trustworthiness. When the origin of each row of data and the path it took to arrive is systematically tracked, users and administrators of the data will have more reason to trust the data.
- Easier troubleshooting. When the ETL data path is self-describing, it makes testing and troubleshooting far easier.
- Expose leaky processes. When each row of data is trackable from source to destination, it helps to reveal any holes in the process where data might be lost.
- Visibility. I can’t count the number of times I have been asked by a client to hunt down all of the business rules in their ETL processes. A side benefit of data lineage tracking is that the places where the business rules hide become more evident.
Not all load mechanisms require data lineage tracking. Very simple processes, those that load volatile tables, and some single-use, “throwaway” code does not have the same need for tracking data lineage. However, these are the exception rather than the rule.
Conclusion
Data lineage tracking in ETL processes is a best practice for most loads. Although it takes time to properly design and implement this pattern, the value gained is almost always worth the effort.
Reference: Pinal Dave (https://blog.sqlauthority.com)
1 Comment. Leave new
I agree 100% that one should consider data lineage in design time but you also need an automated way how to track it in a life system. I recommend to check getmanta.com to solve this challenge.