Azure Data Factory按文件大小拆分文件 [英] Azure Data Factory split file by file size

查看:42
本文介绍了Azure Data Factory按文件大小拆分文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

.我想根据大小分割文件.例如,有一个具有200k行的表,我想设置一个参数以将该表拆分为多个文件,每个文件限制为100Mb(如果有意义).它将返回 N 个文件,具体取决于表的大小.像这样:

out of my two weeks of Azure experience. I want to split files based on a size. For example there is a table with 200k rows I would like to set a parameter to split that table into multiple files with a limit of 100Mb per file (if that makes sense). It will return N number of files depending of the table size. something like:

my.file_1ofN.csv

我正在浏览文档,博客和视频,并可以使用个人帐户中的python脚本使用Azure Functions,Azure Batch和Databricks进行一些POC.问题是该公司不允许我使用这些方法中的任何一种.

I was walking through the documentation, blogs and videos and could do some POC with Azure Functions, Azure Batch, and Databricks with a python script in my personal account. The problem is the company doesn't let me use any of these approaches.

所以我使用分区数分割了文件,但是这些文件的大小取决于表和分区.

So I split the file using the number of partitions but these files are with a different sizes depending on the table and the partition.

有没有办法做到这一点?我正在尝试在管道中进行 lookups foreach 活动,但效果不佳.

Is there a way to accomplish this? I'm experimenting with lookups and foreach activities in the pipeline now but with not good results.

任何想法或线索都将受到欢迎.谢谢!

Any idea or clue will be welcome. Thanks!!

推荐答案

我无法通过大小来解决这个问题,但是如果您可以获得总行数,则可以使用DataFlow输出基于按行计数.

I haven't been able to figure this out by size, but if you can get a total row count, you can use DataFlow to output a rough approximation based on row count.

在管道中:

在此示例中,我从Azure Synapse SQL池中读取数据,因此我正在运行一个Lookup来计算分区"数目.基于每个分区8,000,000行:

In this example, I am reading data out of an Azure Synapse SQL Pool, so I'm running a Lookup to calculate the number of "Partitions" based on 8,000,000 rows per partition:

然后我将结果捕获为变量:

I then capture the result as a variable:

接下来,将变量传递给DataFlow:

Next, pass the variable to the DataFlow:

注意: @int 强制转换是因为DataFlow支持int但管道不支持int,因此在管道中,数据存储在字符串变量中.

NOTE: the @int cast is because DataFlow supports int but pipeline's do not, so in the pipeline the data is stored in a string variable.

在数据流中:

为"partitionCount"创建一个 int 参数,该参数将从管道中传入:

Create an int parameter for "partitionCount", which is passed in from the pipeline:

来源:

在优化"选项卡中,您可以控制读取时如何对数据源进行分区.为此目的,切换到设置分区".然后根据partitionCount变量选择Round Robin:

In the Optimize tab you can control how the source the data is partitioned on read. For this purpose, switch to "Set Partitioning" and select Round Robin based on the partitionCount variable:

这将根据参数将传入数据拆分为X个存储桶.

This will split the incoming data into X number of buckets based on the parameter.

SINK :

在设置"标签下,尝试使用文件名选项"设置来控制输出名称.这些选项有一些限制,因此您可能无法准确获得所需的内容:

Under the Settings tab, experiment with the "File name option" settings to control the output name. The options are a bit limited, so you may have trouble getting exactly what you want:

由于已经对数据进行了分区,因此只需使用默认的源优化"设置:

Since you have already partitioned the data, just use the default Source Optimization settings:

结果:

这将产生X个具有编号命名方案和一致的文件大小的文件:

This will produce X number of files with a numbered naming scheme and consistent file size:

这篇关于Azure Data Factory按文件大小拆分文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆