文件大小较大的Azure Data Lake中的增量负载 [英] Incremental load in Azure Data Lake with Large file size
问题描述
我正在设计Data Factory管道,以将数据从Azure SQL DB加载到Azure Data Factory.
I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.
我的初始加载/POC是一小部分数据,并且能够从SQL表加载到Azure DL.
My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.
现在,我想使用DF从SQL DB加载到Azure DL的表数量巨大(甚至超过十亿).
Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.
MS文档提到了两个选项,即水印列和更改跟踪.
MS docs mentioned two options, i.e. watermark columns and change tracking.
假设我有一个"cust_transaction"具有数百万行的表,如果我加载到DL,则将其加载为
Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".
问题.
1)将SQL DB中的源数据增量加载到数据湖中的该文件中的最佳设计是什么?
1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?
2)如何将文件拆分或分割为较小的文件?
2) How do I split or partition the files into smaller files?
3)如何合并源数据中的增量并将其加载到文件中?
3) How should I merge and load the deltas from source data into the files?
谢谢.
M. Chowdhury
M. Chowdhury
推荐答案
如果您想对当前表进行分区,然后再对数据进行分区,那么滚动窗口触发器将非常适合您的管道-大表中的数据将在串行顺序时间窗口中自动分区;然后这些数据将被复制 分别将文件存储在数据湖中.您可以查看下面的文档.
If you want to partition current table and later on data, a tumbling window trigger will be a good fit for your pipeline - by which data in your big table will be partitioned automatically in serial sequential time window; and then those data will be copied to separately files in data lake. You can take a look at below document.
https://docs.microsoft.com/zh-cn/azure/data-factory/how-to-create-tumbling-window-trigger
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger
这篇关于文件大小较大的Azure Data Lake中的增量负载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!