猪拉丁语 flatten 运算符的模式 [英] schema of flatten operator in pig latin

查看:28
本文介绍了猪拉丁语 flatten 运算符的模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在工作中遇到了这个问题,是关于猪的扁平化.我用一个简单的例子来表达

i recently meet this problem in my work, it's about pig flatten. i use a simple example to express it

两个文件
===文件1===
1_a
2_b
4_d

two files
===file1===
1_a
2_b
4_d

===file2(制表符分隔)===
1个
2 b
3 c

===file2 (tab seperated)===
1 a
2 b
3 c

猪脚本1:

a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
c = join a1 by num, b by num;
dump c;   -- exception java.lang.String cannot be cast to java.lang.Integer

猪脚本2:

a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
a2 = foreach a1 generate (int)num as num, ch as ch;
c = join a2 by num, b by num;
dump c;   -- exception java.lang.String cannot be cast to java.lang.Integer

猪脚本3:

a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
a2 = foreach a1 generate (int)$0 as num, $1 as ch;
c = join a2 by num, b by num;
dump c;   -- right

我不知道为什么脚本 1,2 是错误的,脚本 3 是正确的,我也想知道是否有更简洁的表达式来获得关系 c, thx.

i don't know why script 1,2 are wrong and script 3 right, and i also want to know is there more concise expression to get relation c, thx.

推荐答案

您不使用 PigStorage 有什么特别的原因吗?因为它可以让你的生活变得更轻松:) .

Is there any particular reason you are not using PigStorage? Because it could make life so much easier for you :) .

a = load '/file1' USING PigStorage('_') AS (num:int, char:chararray);
b = load '/file2' USING PigStorage('\t') AS (num:int, char:chararray);
c = join a by num, b by num;
dump c;

另请注意,在 file1 中,您使用下划线作为分隔符,但您将-"作为参数提供给 STRSPLIT.

Also note that, in file1 you used underscore as delimiter, but you give "-" as argument to STRSPLIT.

我在您提供的脚本上花了更多时间;脚本 1 &2 确实不起作用,脚本 3 也是这样工作的(没有额外的 foreach):

edit: I have spent some more time on the scripts you provided; script 1 & 2 indeed does not work and the script 3 also works like this (without the extra foreach):

a = load 'file1' as (str:chararry);
b = load 'file2' as (num:int, ch:chararry);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
c = join a1 by (int)($0), b by num;
dump c;

至于问题的根源,我会胡乱猜测并说可能与此有关(如 Pig 文档中所述)结合 pig 的运行周期优化:

As for the source of the problem, i'll take a wild guess and say it might be related to this (as stated in Pig Documentation) combined with pig's run cycle optimizations :

如果你用空的内部模式 FLATTEN 一个包,结果关系的模式为空.

If you FLATTEN a bag with empty inner schema, the schema for the resulting relation is null.

在您的情况下,我相信在运行之前,STRSPLIT 结果的架构是未知的.

In your case, I believe schema of the STRSPLIT result is unknown until runtime.

edit2:好的,这是我的理论解释:

edit2: Ok, here is my theory explained:

这是脚本 2 的完整解释输出这是脚本 3.我会把有趣的部分贴在这里.

This is the complete -explain- output for script 2 and this is for script 3. I'll just paste the interesting parts here.

|---a2: (Name: LOForEach Schema: num#288:int,ch#289:chararray)
|   |   |
|   |   (Name: LOGenerate[false,false] Schema: num#288:int,ch#289:chararray)ColumnPrune:InputUids=[288, 289]ColumnPrune:OutputUids=[288, 289]
|   |   |   |
|   |   |   (Name: Cast Type: int Uid: 288)
|   |   |   |
|   |   |   |---num:(Name: Project Type: int Uid: 288 Input: 0 Column: (*))

以上部分适用于脚本 2;见最后一行.它假设 flatten(STRSPLIT) 的输出将具有 integer 类型的第一个元素(因为您以这种方式提供了架构).但实际上 STRSPLIT 有一个 null 输出模式,它被视为 bytearray 字段;所以 flatten(STRSPLIT) 的输出实际上是 (n:bytearray, c:bytearray).因为你提供了一个模式,pig 尝试将 java 转换(到 a1 的输出)到 num 字段;失败,因为 num 实际上是一个 java String 表示为 bytearray.由于这个 java 转换失败,pig 甚至没有尝试在上面的行中进行显式转换.

Above section is for script 2; see the last line. It assumes output of flatten(STRSPLIT) will have a first element of type integer (because you provided the schema that way). But in fact STRSPLIT has a null output schema which is treated as bytearray fields; so output of flatten(STRSPLIT) is actually (n:bytearray, c:bytearray). Because you provided a schema, pig tries to make a java cast (to the output of a1) to num field; which fails as num is in fact a java String represented as bytearray. Since this java-cast fails, pig does not even try to make the explicit cast in the line above.

让我们看看脚本 3 的情况:

Let's see the situation for script 3:

|---a2: (Name: LOForEach Schema: num#85:int,ch#87:bytearray)
|   |   |
|   |   (Name: LOGenerate[false,false] Schema: num#85:int,ch#87:bytearray)ColumnPrune:InputUids=[]ColumnPrune:OutputUids=[85, 87]
|   |   |   |
|   |   |   (Name: Cast Type: int Uid: 85)
|   |   |   |
|   |   |   |---(Name: Project Type: bytearray Uid: 85 Input: 0 Column: (*))

见最后一行,这里a1的输出被正确地处理为bytearray,这里没有问题.现在看看倒数第二行;pig 尝试(并成功)进行从 bytearrayinteger 的显式转换操作.

See the last line, here output of a1 is properly treated as bytearray, no problems here. And now look at the second to last line; pig tries (and succeeds) to make an explicit cast operation from bytearray to integer.

这篇关于猪拉丁语 flatten 运算符的模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆