Data lineage is the discipline of understanding how data flows through your organization: where it comes from, where it goes, and what happens to it along the way. Often used in support of regulatory compliance, data governance and technical impact analysis, data lineage answers these questions and more.
Whenever anyone talks about data lineage and how to achieve it, the spotlight tends to shine on automation. This is expected, as automating the process of calculating and establishing lineage is crucial to understanding and maintaining a trustworthy system of data pipelines. After all, the “utopia” of lineage is to automate everything by using various methodologies so that lineage tracking evolves into a hands-off operation without human intervention.
Little is often said about descriptive or manually derived lineage—also often referred to as custom technical lineage or custom lineage—an equally important tool for delivering a comprehensive lineage framework. Unfortunately, descriptive lineage doesn’t get the attention or recognition it deserves. If you say “manual stitching” among data professionals, everyone cringes and runs.
In her book, Data lineage from a business perspective, Dr. Irina Steenbeek introduces the concept of descriptive lineage as “a method to record metadata-based data lineage manually in a repository.”
Descriptive lineage of the past
Lineage solutions in the 1990s were narrowly focused. Typically, they were based on a single technology or use case. Extraction, transformation and loading (ETL) tools dominated the data integration scene at the time, used primarily for data warehousing and business intelligence.
Vendor solutions for lineage and impact analysis only had to operate within the domain of that single solution. This made things simple. Lineage analysis was performed within a closed sandbox, compiling a matrix of connected pathways that implemented a consistent approach to connectivity with a finite set of controls and operators.
Automated lineage is more readily achieved when everything is consistent, from a single vendor and with few unknown patterns. However, this is the equivalent of being blindfolded and locked in a closet.
That approach and viewpoint are now unrealistic and, frankly, useless. The modern data stack dictates that our lineage solutions be far more nimble and able to support a vast number of solutions. Now, lineage must be able to provide tools to connect things by using nuts and bolts when there aren’t any other methods.
Descriptive lineage use cases
When discussing use cases for descriptive lineage, it is important to consider the target user community for each. The first two use cases are primarily aimed at a technical audience, as the lineage definitions apply to actual physical assets.
The last two use cases are more abstract, at a higher level, and have direct appeal to less technical users interested in the big picture. However, even low-level lineage for physical assets has value for everyone because it gets summarized by lineage tools and bubbles up to “big picture” insights beneficial to the entire organization.
Critical and quick bridges
The demand for lineage extends far beyond dedicated systems such as the ETL example. Descriptive lineage is often encountered in that single-tool scenario, but even there, you discover situations that cannot be covered by automation.
Examples include rarely seen usage patterns understood only by deep experts of a particular tool, strange new syntax that parsers are unable to comprehend, short-lived but inevitable anomalies, missing chunks of source code, and complex wrappers around legacy routines and procedures. Simple scripted or manually copied sequential (flat) files are also covered by this use case.
Descriptive lineage enables you to bind assets together that aren’t otherwise connected automatically. This applies to assets disconnected due to technological limitations, true missing links or lack of permission to access the actual source code.
In this use case, descriptive lineage extends the lineage we already have, making it more complete, filling gaps and crossing bridges. This is also known as hybrid lineage, which takes maximum advantage of automation while complementing it with more assets and connection points.
Support for new tools
Ever-expanding technology portfolios present the next major use case for descriptive lineage. As our industry explores new domains and solutions to maximize the value of our data, we witness the proliferation of environments where everything interacts with our data.
It is rare for a site to have just one dedicated toolset. Data is touched and manipulated by a myriad of solutions, including on-premises and cloud transformation tools, databases and data lake houses. Resources from legacy systems, both defunct and active, along with new reporting tools, also play a role.
The sheer array of technologies in use today is mind-boggling and ever-growing. While automated lineage across the spectrum might be the objective, there aren’t enough vendors, practitioners and solution providers to create an ultimate automation “easy button” for such a complex universe.
Therefore, there is a need for descriptive lineage to define new systems, new data assets and new connection points, and connect them to what has already been parsed or tracked by using automation.
Application-level lineage
Descriptive lineage is also used for higher-level or application-level lineage, sometimes called business lineage. This is often difficult to achieve by using automation, precisely because there are no fixed industry definitions for application-level lineage.
The perfect definition of high-level lineage for one user or group of users might not fit the exact design envisioned by your lead data architects. Descriptive lineage enables you to define the lineage you need, at whatever depth is required.
This is a truly fit-for-purpose lineage, typically staying at high levels of abstraction, not even mentioning anything deeper than a particular database cluster or the name of an application area. For certain parts of a financial organization, lineage might be generic, leading to a target area called “risk aggregation.”
Future lineage
One more use case for descriptive lineage is “to-be” or future lineage. The ability to model the lineage of future applications (especially when realized in a hybrid form alongside existing lineage definitions) helps the organization assess the work effort, measure the potential impact on existing teams and systems, and track progress along the way.
Descriptive lineage for future applications is not hindered by the fact that the source code has not yet been returned or released, isn’t running in production or is only outlined on a chalkboard. Future lineage can exist independently or be combined with existing lineage in the hybrid model described earlier.
These are just some of the ways that descriptive lineage complements overall objectives for lineage visibility across the enterprise. Descriptive lineage completes the blanks, supports future designs, bridges gaps and augments your overall lineage solutions, yielding deeper insights into your environment that lead to increased trust and the ability to make better business decisions.
Enhance your applications with descriptive lineage. Gain insights and make better decisions. Contact your IBM representative for more information.
Learn about implementing manual lineage
Was this article helpful?
YesNo