文件大小较大的Azure Data Lake中的增量负载 [英] Incremental load in Azure Data Lake with Large file size

查看：67 发布时间：2019/6/18 19:48:00 AzureDataLake

本文介绍了文件大小较大的Azure Data Lake中的增量负载的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我正在设计Data Factory管道，以将数据从Azure SQL DB加载到Azure Data Factory.

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.

我的初始加载/POC是一小部分数据，并且能够从SQL表加载到Azure DL.

My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.

现在，我想使用DF从SQL DB加载到Azure DL的表数量巨大(甚至超过十亿).

Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.

MS文档提到了两个选项，即水印列和更改跟踪.

MS docs mentioned two options, i.e. watermark columns and change tracking.

假设我有一个"cust_transaction"具有数百万行的表，如果我加载到DL，则将其加载为

Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".

问题.

1)将SQL DB中的源数据增量加载到数据湖中的该文件中的最佳设计是什么?

1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?

2)如何将文件拆分或分割为较小的文件?

2) How do I split or partition the files into smaller files?

3)如何合并源数据中的增量并将其加载到文件中?

3) How should I merge and load the deltas from source data into the files?

谢谢.

M. Chowdhury