如何仅使用U-SQL和文件中的某些字段将大文件划分为文件/目录? [英] How do I partition a large file into files/directories using only U-SQL and certain fields in the file?

查看:73
本文介绍了如何仅使用U-SQL和文件中的某些字段将大文件划分为文件/目录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的CSV,每行包含客户和商店ID,以及交易信息.当前的测试文件约为40 GB(大约2天),因此对于选择查询,在任何合理的返回时间内,分区都是绝对必须的.

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.

我的问题是:当我们收到一个文件时,它包含多个商店的数据.我想使用虚拟列"功能将该文件分成相应的目录结构.该结构为"/Data/{CustomerId}/{StoreID}/file.csv".

My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".

我尚未将其与OUTPUT语句配合使用.因此,该语句的使用是:

I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:

// Output to file
OUTPUT @dt
TO @"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();

它给出了以下错误:

Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d

有人尝试过类似的事情吗?我试图将这些字段的输出路径连接起来,但这是不行的.我考虑过将其作为一个函数(UDF)来使用,它需要两个ID并过滤整个数据集,但这似乎效率很低.

Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.

在此先感谢您的阅读/回复!

Thanks in advance for reading/responding!

推荐答案

当前,U-SQL要求必须在编译时理解脚本的所有文件输出.换句话说,不能基于输入数据来创建输出文件.

Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.

基于数据的动态输出是我们正在积极努力的工作,将于2017年晚些时候发布.

Dynamic outputs based on data are something we are actively working for release sometime later in 2017.

同时,在提供动态输出功能之前,需要使用两个脚本来完成所需的模式

In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts

第一个脚本将使用GROUP BY来标识CustomerNumber和StoreNumber的所有唯一组合,并将其写入文件中.

The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.

然后通过使用脚本或使用我们的SDK编写的工具,下载先前的输出文件,然后以编程方式创建第二个U-SQL脚本,该脚本对每对CustomerNumber和StoreNumber都有一个明确的OUTPUT语句

Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber

这篇关于如何仅使用U-SQL和文件中的某些字段将大文件划分为文件/目录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆