Databricks、Synapse 和 ADLS gen2 的数据治理解决方案 [英] Data Governance solution for Databricks, Synapse and ADLS gen2

查看:97
本文介绍了Databricks、Synapse 和 ADLS gen2 的数据治理解决方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是数据治理的新手,如果问题缺少一些信息,请见谅.

I'm new to data governance, forgive me if question lack some information.

我们正在构建数据湖 &Azure 平台上的中型电信公司从头开始的企业数据仓库.我们将 ADLS gen2、Databricks 和 Synapse 用于我们的 ETL 处理、数据科学、ML &质量检查活动.

We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.

我们已经有大约一百个输入表和 25 TB/年.未来我们期待更多.

We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more.

企业有强烈的需求,倾向于采用与云无关的解决方案.他们仍然可以使用 Databricks,因为它可以在 AWS 和 Azure 上使用.

Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay with Databricks since it's available on AWS and Azure.

最适合我们的堆栈和要求的数据治理解决方案是什么?

我还没有使用过任何数据治理解决方案.我喜欢 AWS Data Lake 解决方案,因为它提供了基本功能——盒子.AFAIK,Azure Data Catalog 已过时,因为它不支持 ADLS gen2.

I haven't used any data governance solutions yet. I like AWS Data Lake solution, since it provide basic functionality out-of-the-box. AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2.

在非常快速的谷歌搜索之后,我找到了三个选项:

After very quick googling I found three options:

  1. Databricks 隐私
  2. Databricks Immuta
  3. Apache Ranger &阿帕奇地图集.

目前我什至不确定第三个选项是否完全支持我们的 Azure 堆栈.此外,它将有更大的开发(基础设施定义)工作.那么有什么理由让我研究 Ranger/Atlas 的方向吗?

Currently I'm not even sure if the 3rd option has full support for our Azure stack. Moreover, it will have much bigger development (infrastructure definition) effort. So is there any reasons I should look into Ranger/Atlas direction?

选择 Privacera 而非 Immuta 的原因是什么,反之亦然?

还有其他我应该评估的选项吗?

从数据治理的角度来看,我们只做了以下几件事:

From Data Governance perspective we have done only the following things:

  1. 在 ADLS 中定义数据区域
  2. 对敏感数据应用加密/混淆处理(根据 GDPR 要求).
  3. 在 Synapse 和 Power BI 层实施行级安全性 (RLS)
  4. 用于记录和记录内容的自定义审计框架什么时候坚持

要做的事情

  1. 数据沿袭和单一事实来源.即使在开始后的 4 个月内,理解数据集之间的依赖关系也成为一个痛点.血统信息存储在 Confluence 内部,难以维护并在多个地方持续更新.即使现在它在某些地方已经过时了.
  2. 安全.未来业务用户可能会在 Databricks Notebooks 中进行一些数据探索.我们需要用于 Databricks 的 RLS.
  3. 数据生命周期管理.
  4. 可能还有其他与数据治理相关的内容,例如数据质量等.

推荐答案

我目前正在探索 Immuta 和 Privacera,所以我还不能详细评论这两者之间的差异.到目前为止,Immuta 给我留下了更好的印象,它基于优雅的策略设置.

I am currently exploring Immuta and Privacera, so I can't yet comment in detail on differences between these two. So far, Immuta gave me better impression with it's elegant policy based setup.

不过,有一些方法可以在不购买外部组件的情况下解决您上面提到的一些问题:

Still, there are ways to solve some of the issues you mentioned above without buying an external component:

1.安全

  • 对于 RLS,请考虑使用表 ACL,并仅授予对某些 Hive 视图的访问权限.

  • For RLS, consider using Table ACLs, and giving access only to certain Hive views.

要访问 ADLS 内的数据,请查看在集群上启用密码传递.不幸的是,你禁用了 Scala.

For getting access to data inside ADLS, look at enabling password pass-through on clusters. Unfortunately, then you disable Scala.

您仍然需要在 Azure Data Lake Gen 2 上设置权限,这对于授予现有子项的权限来说是一种糟糕的体验.

You still need to setup permissions on Azure Data Lake Gen 2, which is awful experience for giving permissions on existing child items.

请避免使用列/行子集创建数据集副本,因为数据重复绝不是一个好主意.

Please avoid creating dataset copies with columns/rows subsets, as data duplication is never a good idea.

2.血统

  • One option would be to look into Apache Atlas & Spline. Here is one example how to set this up https://medium.com/@reenugrewal/data-lineage-tracking-using-spline-on-atlas-via-event-hub-6816be0fd5c7
  • Unfortunately, Spline is still under development, even reproducing the setup mention in the article is not straight forward. Good news that Apache Atlas 3.0 has many available definitions to Azure Data Lake Gen 2 and other sources
  • In a few projects, I ended up creating custom logging of reads/writes (seems like you went on this path also). Based on these logs, I created a Power BI report to visualize the lineage.
  • Consider using Azure Data Factory for orchestration. With a proper ADF pipeline structure, you can have a high level lineage and help you see dependencies and rerun failed activities. You can read a bit more here: https://mrpaulandrew.com/2020/07/01/adf-procfwk-v1-8-complete-pipeline-dependency-chains-for-failure-handling/
  • Take a look at Marquez https://marquezproject.github.io/marquez/. Small open-source library that has some nice features, including data lineage.

3.数据质量

  • 目前仅研究 Amazon Deequ - Scala,但具有一些不错的预定义数据质量函数.
  • 在许多项目中,我们最终编写了集成测试,检查从青铜(原始)到白银(标准化)之间的数据质量.没什么特别的,纯粹的 PySpark.

4.数据生命周期管理

  • 一种选择是使用原生数据湖存储生命周期管理.这不是 Delta/Parquet 格式背后的可行替代方案.

  • One option is to use native data lake storage lifecycle management. That's not a viable alternative behind Delta/Parquet formats.

如果您使用 Delta 格式,您可以更轻松地应用保留或伪匿名

If you use Delta format, you can easier apply retention or pseudoanonymize

第二个选项,假设您有一个包含所有数据集信息的表(数据集友好名称、路径、保留时间、区域、敏感列、所有者等).您的 Databricks 用户使用一个小包装器来读/写:

Second option, imagine that you have a table with information about all datasets (dataset_friendly_name, path, retention time, zone, sensitive_columns, owner, etc.). Your Databricks users use a small wrapper to read/write:

DataWrapper.Read("dataset_friendly_name")

DataWrapper.Read("dataset_friendly_name")

DataWrapper.Write("destination_dataset_friendly_name")

DataWrapper.Write("destination_dataset_friendly_name")

然后由您来在幕后实现日志记录和数据加载.此外,您可以跳过基于保留时间的敏感列(均在数据集信息表中可用).需要一些努力

It's up to you then to implement the logging, data loading behind the scenes. In addition you can skip sensitive_columns, acts based on retention time (both available in dataset info table). Requires quite some effort

  • 您始终可以将此表扩展为更高级的架构,添加有关管道、依赖项等的额外信息(请参阅 2.4)

希望你能在我的回答中找到有用的东西.知道你走哪条路会很有趣.

Hopefully you find something useful in my answer. It would be interesting to know which path you took.

这篇关于Databricks、Synapse 和 ADLS gen2 的数据治理解决方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆