Change-Aware Data Validation with Column-Level Lineage

Define and analyze the effect radius of data model changes

Data transformation tools like dbt make building SQL data pipelines easy and systematic. But in spite of the added structure and surely defined data models, pipelines can still end up complex, which makes debugging problems and validating changes to data models difficult.

The rising complexity of data transformation logic offers rise to the following issues:

Deep Learning for Beginners: A Practical Guide

Future of Data Science

How SSL Encryption Secures Big Data In Cloud Computing?

How To Kick Start Your Machine Learning Career?

1. Traditional code evaluation processes only look at code modifications and exclude the data effect of these modifications.

2. Data effect resulting from code modifications is hard to trace. In extending DAGs with nested dependencies, discovering how and where data effect happens is extremely time-consuming, or near impossible.

Gitlab’s dbt DAG (shown in the featured image above) is the appropriate example of a data project that’s already a house-of cards. Imagine looking to follow a simple SQL logic change to a column via this entire lineage DAG. Reviewing a data model update could be a frightening task.

How could you approach this type of review?

What is data validation?

Data validation refers back to the process used to decide that the data is accurate in terms of real-world necessities. This means ensuring that the SQL logic in a data model behaves as deliberated by verifying that the data is correct. Validation is typically performed after changing a data model, which includes accommodating new necessities, or as a part of a refactor.

A unique review challenge

Data has states and is straight away affected by the transformation used to generate it. This is why reviewing data model modifications is a unique challenge, because both the code and the data requires to be reviewed.

Due to this, data model updates must be reviewed not only for completeness, however also context. In other words, that the statistics is correct and current data and metrics were not accidentally altered.

Two extremes of data validation

In most data groups, the person making the change relies on institutional knowledge, intuition, or past experience to evaluate the effect and validate the change.

“I’ve made a change to X, I think I recognize what the effect should be. I’ll check it by running Y”

The validation technique normally falls into considered one of two extremes, neither of which is ideal:

1. Spot-checking with queries and some high-level checks like row count and schema. It’s fast but risks missing real impact. Critical and silent errors can go omitted.

2. Exhaustive checking of every single downstream model. It’s slow and resources intensive, and can be expensive because the pipeline grows.

This outcomes in a data reviewing procedure that is unstructured, hard to repeat, and often introduces silent errors. A new technique is needed that supports the engineer to carry out accurate and targeted data validation.

A better approach by understanding data model dependencies

To validate a modification to a data venture, it’s essential to understand the relationship among models and how data flows by the project. These dependencies among models tell us how data is passed and converted from one model to another.

Analyze the relationship among models

As we’ve seen, data venture DAGs may be big, but a data model modify only influences a subset of models. By setting apart this subset after which analyzing the connection among the models, you may peel back the layers of complexity and focus simply at the models that really requires validating, given a specific SQL logic change.

The type of dependencies in a data venture are:

Model-to model

A structural dependency wherein columns are decided from an upstream model.

— downstream_model
select
a,
b
from {{ ref(“upstream_model”) }}

Column-to-column

A projection dependency that selects, renames, or transforms an upstream column.

— downstream_model
select
a,
b as b2
from {{ ref(“upstream_model”) }}

Model-to-column

A clear out dependency in which a downstream model makes use of an upstream model in a where, join, or other conditional clause.

— downstream_model
select
a
from {{ ref(“upstream_model”) }}
where b > 0

Understanding the dependencies among models assists us to define the effect radius of a data model logic change.

Identify the effect radius

When making modifications to a data model’s SQL, it’s essential to understand which different models could be impacted (the models you have to take a look at). At the high level, this is executed by model-to-model relationships. This subset of DAG nodes is known as the effect radius.

In the DAG beneath, the effect radius includes nodes B (the modified model) and D (the downstream model). In dbt, those models may be identified the usage of the modified+ selector.

DAG showing modified model B and downstream dependency D. Upstream model A and unrelated model C are not impacted
Photo Credit: https://towardsdatascience.com/

Identifying modified nodes and downstream is a great beginning, and through setting apart modifications like this you’ll lessen the potential data validation location. However, this may still result in a big number of downstream models.

Classifying the types of SQL changes can in addition support you to prioritize which models absolutely need validation by understanding the severity of the modifications, eliminating branches with modifications which are recognized to be secure.

Classify the SQL modification

Not all SQL modification convey the same level of threat to downstream data, and so ought to be classified accordingly. By classifying SQL modifications this manner, you could add a systematic approach to your data overview procedure.

A SQL change to a data model can be categorized as one of the following:

Non-breaking modification

Changes that do not effect the data in downstream models which includes including adding new columns, adjustments to SQL formatting, or adding comments and many others.

— Non-breaking change: New column added
select
id,
category,
created_at,
— new column
now() as ingestion_time
from {{ ref(‘a’) }}

Partial-breaking modification

Changes that simplest effect downstream models that reference certain columns along with removing or renaming a column; or enhancing a column definition.

— Partial breaking change: `category` column renamed
select
id,
created_at,
category as event_category
from {{ ref(‘a’) }}

Breaking modification

Changes that effect all downstream models including filtering, sorting, or otherwise changing the structure or meaning of the transformed data.

— Breaking change: Filtered to exclude data
select
id,
category,
created_at
from {{ ref(‘a’) }}
where category != ‘internal’

Apply classification to reduce scope

After applying those classifications the effect radius, and the number of models that require to be validated, can be significantly decreased.

DAG showing three categories of change: non-breaking, partial-breaking, and breaking
Photo Credit: https://towardsdatascience.com/

In the above DAG, nodes B, C and F had been changed, ensuing in potentially 7 nodes that require to be validated (C to E). However, not every branch contain of SQL changes that certainly require validation. Let’s take a look at each branch:

Node C: Non-breaking modification

C is classified as a non-breaking modification. Therefore both C and H do not required to be checked, they may be eliminated.

Node B: Partial-breaking modification

B is classified as a partial-breaking change because of modification to the column B.C1. Therefore, D and E need to be checked only in the they reference column B.C1.

Node F: Breaking modification

The modification to model F is classified as a breaking-change. Therefore, all downstream nodes (G and E) require to be checked for effect. For instance, model G would aggregate data from the modified upstream column

The initial 7 nodes have already been decreased to 5 that need to be checked for data effect (B, D, E, F, G). Now, by examining the SQL modifications at the column level, we can reduce that number even in further.

Narrowing the scope in addition with column-level lineage

Breaking and non-breaking modifications are easy to classify but, in comes of inspecting partial-breaking modifications, the models require to be analyzed at the column level.

Let’s take a closer observe the partial-breaking modification in model B, in which the logic of column c1 has been modified. This modification could probably result in four impacted downstream nodes: D, E, K, and J. After tracking column usage downstream, this subset may be similarly reduced.

DAG showing the column-level lineage used to trace the downstream impact of a change to column B.c1
Photo Credit: https://towardsdatascience.com/

Following column B.C1 downstream we are able to see that:

B.C1 → D.C1 is a column-to-column (projection) dependency.
D.C1 → E is a model-to-column dependency.
D → K is a model-to-model dependency. However, as D.C1 isn’t always utilized in K, this model may be removed.

Therefore, the models that want to be validated in this branch are B, D, and E. Together with the breaking change F and downstream G, the total models to be validated on this diagram are F, G, B, D, and E, or just five out of a complete of nine doubtlessly effected models.

Conclusion

Data validation after a model modification is hard, specifically in huge and complex DAGs. It’s easy to miss silent errors and performing validation becomes a daunting venture, with data models often feeling like black boxes when it comes to downstream effect.

A based and repeatable process

By the usage of this modification-aware data validation technique, you may bring structure and precision to the review procedure, making it systematic and repeatable. This reduces the number of models that need to be checked, simplifies the review procedure, and lowers cost by only validating models that truly require it.

Change-Aware Data Validation with Column-Level Lineage

Deep Learning for Beginners: A Practical Guide

Future of Data Science

How SSL Encryption Secures Big Data In Cloud Computing?

How To Kick Start Your Machine Learning Career?

BRICS leaders to demand for data protections against unauthorized AI use

Tencent improves testing creative AI models with new benchmark

Tarun Khanna

Related Posts

Natural Language Processing In Finance- Acing Digitization Game

Machine Learning Life Cycle Management

Top 10 Machine Learning Algorithms for Data Scientists (Including Real-World Case Studies)

Top 10 Python Libraries for Machine Learning

Tencent improves testing creative AI models with new benchmark

Leave a Reply Cancel reply

TRENDING

Study finds AI can assist building become more secure, resilient and more sustainable

AI takes flight: Venture to enhance machine learning in aircraft surface inspections

How This Uncertain Pandemic Influenced The Time Series Model In Production ?

Top 10 Machine Learning Algorithms for Data Scientists (Including Real-World Case Studies)

Anthropic invented ‘AI microscope’ to show how large language models think

How SSL Encryption Secures Big Data In Cloud Computing?

DeepTech Bytes

Quick Links

Topics

Connect

Welcome Back!

Retrieve your password

Change-Aware Data Validation with Column-Level Lineage

Also Read:

What is data validation?

A unique review challenge

Two extremes of data validation

A better approach by understanding data model dependencies

Analyze the relationship among models

Model-to model

Column-to-column

Model-to-column

Identify the effect radius

Classify the SQL modification

Non-breaking modification

Partial-breaking modification

Breaking modification

Apply classification to reduce scope

Node C: Non-breaking modification

Node B: Partial-breaking modification

Node F: Breaking modification

Narrowing the scope in addition with column-level lineage

Conclusion

A based and repeatable process

BRICS leaders to demand for data protections against unauthorized AI use

Tencent improves testing creative AI models with new benchmark

Related Posts

Leave a Reply Cancel reply

TRENDING

DeepTech Bytes

Quick Links

Topics

Connect

Welcome Back!

Retrieve your password