Azure 数据湖中的 U-SQL 输出 [英] U-SQL Output in Azure Data Lake

查看:27
本文介绍了Azure 数据湖中的 U-SQL 输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我不知道表包含多少个不同的键值,是否可以根据列值自动将表拆分为多个文件?是否可以将键值放入文件名中?

解决方案

这是我们的 top ask(之前已经也在 stackoverflow 上询问 :).我们目前正在研究它,希望能在夏天推出.<​​/p>

在那之前,您必须编写一个脚本生成器.我倾向于使用 U-SQL 来生成脚本,但您可以使用 Powershell 或 T4 等来完成.

这是一个例子:

假设您要为下表/行集 @x 中的列 name 编写文件:

名称 |值 1 |值2-----+--------+-------一个 |10 |20一个 |11 |21乙 |10 |30乙 |100 |200

您将编写一个脚本来生成如下所示的脚本:

@x = SELECT * FROM (VALUES("A", 10, 20), ("A", 11, 21), ("B", 10, 30), ("B", 100, 200)) AS T(name, value1, value2);//生成脚本以根据名称列进行分区输出:@stmts =SELECT "OUTPUT (SELECT value1, value2 FROM @x WHERE name == ""+name+"") TO "/output/"+name+".csv" USING Outputters.Csv();"AS输出FROM (SELECT DISTINCT name FROM @x) AS x;输出@stmts 到/output/genscript.usql"使用 Outputters.Text(delimiter:' ', quoting:false);

然后你取genscript.usql,把@x的计算放在前面,然后提交,把数据分成两个文件.

Would it be possible to automatically split a table into several files based on column values if I don't know how many different key values the table contains? Is it possible to put the key value into the filename?

解决方案

This is our top ask (and has been previously asked on stackoverflow too :). We are currently working on it and hopefully have it available by summer.

Until then you have to write a script generator. I tend to use U-SQL to generate the script but you could do it with Powershell or T4 etc.

Here is an example:

Let's assume you want to write files for the column name in the following table/rowset @x:

name | value1 | value2
-----+--------+-------
A    | 10     | 20
A    | 11     | 21
B    | 10     | 30
B    | 100    | 200

You would write a script to generate the script like the following:

@x = SELECT * FROM (VALUES( "A", 10, 20), ("A", 11, 21), ("B", 10, 30), ("B", 100, 200)) AS T(name, value1, value2);

// Generate the script to do partitioned output based on name column:

@stmts = 
  SELECT "OUTPUT (SELECT value1, value2 FROM @x WHERE name == ""+name+"") TO "/output/"+name+".csv" USING Outputters.Csv();" AS output 
  FROM (SELECT DISTINCT name FROM @x) AS x;

OUTPUT @stmts TO "/output/genscript.usql" 
USING Outputters.Text(delimiter:' ', quoting:false);

Then you take genscript.usql, prepend the calculation of @x and submit it to get the data partitioned into the two files.

这篇关于Azure 数据湖中的 U-SQL 输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆