文件大小较大的Azure Data Lake中的增量负载 [英] Incremental load in Azure Data Lake with Large file size

查看:67
本文介绍了文件大小较大的Azure Data Lake中的增量负载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计Data Factory管道,以将数据从Azure SQL DB加载到Azure Data Factory.

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.

我的初始加载/POC是一小部分数据,并且能够从SQL表加载到Azure DL.

My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.

现在,我想使用DF从SQL DB加载到Azure DL的表数量巨大(甚至超过十亿).

Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL. 

MS文档提到了两个选项,即水印列和更改跟踪.

MS docs mentioned two options, i.e. watermark columns and change tracking.

假设我有一个"cust_transaction"具有数百万行的表,如果我加载到DL,则将其加载为

Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt". 

问题.

1)将SQL DB中的源数据增量加载到数据湖中的该文件中的最佳设计是什么?

1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?

2)如何将文件拆分或分割为较小的文件?

2) How do I split or partition the files into smaller files?

3)如何合并源数据中的增量并将其加载到文件中?

3) How should I merge and load the deltas from source data into the files?

谢谢.

M. Chowdhury

M. Chowdhury

推荐答案

如果您想对当前表进行分区,然后再对数据进行分区,那么滚动窗口触发器将非常适合您的管道-大表中的数据将在串行顺序时间窗口中自动分区;然后这些数据将被复制 分别将文件存储在数据湖中.您可以查看下面的文档.

If you want to partition current table and later on data, a tumbling window trigger will be a good fit for your pipeline - by which data in your big table will be partitioned automatically in serial sequential time window; and then those data will be copied to separately files in data lake. You can take a look at below document.

https://docs.microsoft.com/zh-cn/azure/data-factory/how-to-create-tumbling-window-trigger

https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger


这篇关于文件大小较大的Azure Data Lake中的增量负载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆