适用于Databrick,Synapse和ADLS gen2的数据治理解决方案 [英] Data Governance solution for Databricks, Synapse and ADLS gen2

查看:266
本文介绍了适用于Databrick,Synapse和ADLS gen2的数据治理解决方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是数据治理的新手,如果问题缺少一些信息,请原谅我.

I'm new to data governance, forgive me if question lack some information.

我们正在建设数据湖&从零开始为Azure平台上的中型电信公司提供企业数据仓库.我们正在将ADLS Gen2,Databricks和Synapse用于我们的ETL处理,数据科学,ML&质量检查活动.

We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.

我们已经有大约每年25 TB的输入表.将来我们会期望更多.

We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more.

企业对云不可知的解决方案有强烈的要求.由于Databricks可以在AWS和Azure上使用,因此他们仍然可以接受.

Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay with Databricks since it's available on AWS and Azure.

对于我们的堆栈和要求而言,最佳的数据治理解决方案是什么?

我还没有使用任何数据治理解决方案.我喜欢 AWS Data Lake 解决方案,因为它提供了基本的功能盒子. AFAIK, Azure数据目录已过时,因为它不支持ADLS第2代.

I haven't used any data governance solutions yet. I like AWS Data Lake solution, since it provide basic functionality out-of-the-box. AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2.

快速浏览后,我发现了三个选择:

After very quick googling I found three options:

  1. Databricks Privacera
  2. Databricks Immuta
  3. Apache Ranger&阿帕奇地图集.
  1. Databricks Privacera
  2. Databricks Immuta
  3. Apache Ranger & Apache Atlas.

目前,我什至不确定第3个选项是否完全支持我们的Azure堆栈.此外,它将有更大的发展(基础架构定义)工作. 所以我有什么理由应该关注Ranger/Atlas的方向?

Currently I'm not even sure if the 3rd option has full support for our Azure stack. Moreover, it will have much bigger development (infrastructure definition) effort. So is there any reasons I should look into Ranger/Atlas direction?

与Primmera相比,Immuta更喜欢Privacera的原因是什么?

我还应该评估其他选项吗?

从数据治理的角度来看,我们仅做以下事情:

From Data Governance perspective we have done only the following things:

  1. 在ADLS内定义数据区域
  2. 由于GDPR要求,对敏感数据应用加密/模糊处理.
  3. 在Synapse和Power BI层上实现行级安全性(RLS)
  4. 用于记录什么&的自定义审核框架坚持什么时候

要做的事情

  1. 数据血统和真理的单一来源.即使从开始的4个月开始,了解数据集之间的依赖关系仍然是一个难题.沿袭信息存储在Confluence内部,很难在多个地方进行维护和持续更新.即使到现在,它在某些地方也已经过时了.
  2. 安全性.将来,业务用户可能会在Databricks Notebook中进行一些数据探索.我们需要用于Databricks的RLS.
  3. 数据生命周期管理.
  4. 也许其他与数据治理有关的东西,例如数据质量等.

推荐答案

我目前正在研究Immuta和Privacera,因此我无法详细评论这两者之间的区别.到目前为止,Immuta优美的基于策略的设置给我带来了更好的印象.

I am currently exploring Immuta and Privacera, so I can't yet comment in detail on differences between these two. So far, Immuta gave me better impression with it's elegant policy based setup.

仍然,有一些方法可以解决您上面提到的一些问题,而无需购买外部组件:

Still, there are ways to solve some of the issues you mentioned above without buying an external component:

1.安全性

  • 对于RLS,请考虑使用表ACL,并仅授予对某些Hive视图的访问权限.

  • For RLS, consider using Table ACLs, and giving access only to certain Hive views.

要访问ADLS内部的数据,请查看在群集上启用密码传递.不幸的是,您然后禁用了Scala.

For getting access to data inside ADLS, look at enabling password pass-through on clusters. Unfortunately, then you disable Scala.

您仍然需要在Azure Data Lake Gen 2上设置权限,这是对现有子项授予权限的绝妙体验.

You still need to setup permissions on Azure Data Lake Gen 2, which is awful experience for giving permissions on existing child items.

请避免使用列/行子集创建数据集副本,因为数据复制绝不是一个好主意.

Please avoid creating dataset copies with columns/rows subsets, as data duplication is never a good idea.

2.世系

  • One option would be to look into Apache Atlas & Spline. Here is one example how to set this up https://medium.com/@reenugrewal/data-lineage-tracking-using-spline-on-atlas-via-event-hub-6816be0fd5c7
  • Unfortunately, Spline is still under development, even reproducing the setup mention in the article is not straight forward. Good news that Apache Atlas 3.0 has many available definitions to Azure Data Lake Gen 2 and other sources
  • In a few projects, I ended up creating custom logging of reads/writes (seems like you went on this path also). Based on these logs, I created a Power BI report to visualize the lineage.
  • Consider using Azure Data Factory for orchestration. With a proper ADF pipeline structure, you can have a high level lineage and help you see dependencies and rerun failed activities. You can read a bit more here: https://mrpaulandrew.com/2020/07/01/adf-procfwk-v1-8-complete-pipeline-dependency-chains-for-failure-handling/
  • Take a look at Marquez https://marquezproject.github.io/marquez/. Small open-source library that has some nice features, including data lineage.

3.数据质量

  • 到目前为止,仅研究Amazon Deequ-Scala,但具有一些不错的预定义数据质量功能.
  • 在许多项目中,我们最终编写了集成测试,检查从青铜(原始)到银(标准化)之间的数据质量.没什么,纯PySpark.

4.数据生命周期管理

  • 一种选择是使用本机数据湖存储生命周期管理.在Delta/Parquet格式之后,这不是可行的选择.

  • One option is to use native data lake storage lifecycle management. That's not a viable alternative behind Delta/Parquet formats.

如果使用Delta格式,则可以更轻松地应用保留或伪匿名化

If you use Delta format, you can easier apply retention or pseudoanonymize

第二个选项,假设您有一个表,其中包含有关所有数据集的信息(数据集友好名称,路径,保留时间,区域,敏感列,所有者等).您的Databricks用户使用一个小的包装器来读取/写入:

Second option, imagine that you have a table with information about all datasets (dataset_friendly_name, path, retention time, zone, sensitive_columns, owner, etc.). Your Databricks users use a small wrapper to read/write:

DataWrapper.Read(" dataset_friendly_name")

DataWrapper.Read("dataset_friendly_name")

DataWrapper.Write("destination_dataset_friendly_name")

DataWrapper.Write("destination_dataset_friendly_name")

然后由您来实现日志记录,在后台加载数据.此外,您还可以跳过sensitive_columns(基于保留时间)(在数据集信息表中均可用).需要一些努力

It's up to you then to implement the logging, data loading behind the scenes. In addition you can skip sensitive_columns, acts based on retention time (both available in dataset info table). Requires quite some effort

  • 您始终可以将此表扩展为更高级的架构,添加有关管道,依赖项等的其他信息(请参见2.4)

希望您能从我的回答中找到有用的信息.知道您走哪条路会很有趣.

Hopefully you find something useful in my answer. It would be interesting to know which path you took.

这篇关于适用于Databrick,Synapse和ADLS gen2的数据治理解决方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆