猪拉丁语中flatten运算符的模式 [英] schema of flatten operator in pig latin
问题描述
我最近在工作中遇到了这个问题,这是关于猪的扁平化.我用一个简单的例子来表达它
i recently meet this problem in my work, it's about pig flatten. i use a simple example to express it
两个文件
===文件1 ===
1_a
2_b
4_d
two files
===file1===
1_a
2_b
4_d
=== file2(制表符分隔)===
1个
2 b
3 c
===file2 (tab seperated)===
1 a
2 b
3 c
猪脚本1:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
c = join a1 by num, b by num;
dump c; -- exception java.lang.String cannot be cast to java.lang.Integer
猪脚本2:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
a2 = foreach a1 generate (int)num as num, ch as ch;
c = join a2 by num, b by num;
dump c; -- exception java.lang.String cannot be cast to java.lang.Integer
猪脚本3:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
a2 = foreach a1 generate (int)$0 as num, $1 as ch;
c = join a2 by num, b by num;
dump c; -- right
我不知道为什么脚本1,2是错误的,而脚本3是正确的,我也想知道是否存在更简洁的表达式来获得关系c,thx.
i don't know why script 1,2 are wrong and script 3 right, and i also want to know is there more concise expression to get relation c, thx.
推荐答案
您是否有任何特定原因不使用PigStorage?因为它可以使您的生活更加轻松:).
Is there any particular reason you are not using PigStorage? Because it could make life so much easier for you :) .
a = load '/file1' USING PigStorage('_') AS (num:int, char:chararray);
b = load '/file2' USING PigStorage('\t') AS (num:int, char:chararray);
c = join a by num, b by num;
dump c;
还请注意,在file1中,您使用下划线作为分隔符,但将STRSPLIT用作参数.
Also note that, in file1 you used underscore as delimiter, but you give "-" as argument to STRSPLIT.
修改: 我花了更多时间在您提供的脚本上.脚本1和2确实不起作用,脚本3也可以这样工作(没有多余的foreach):
edit: I have spent some more time on the scripts you provided; script 1 & 2 indeed does not work and the script 3 also works like this (without the extra foreach):
a = load 'file1' as (str:chararry);
b = load 'file2' as (num:int, ch:chararry);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
c = join a1 by (int)($0), b by num;
dump c;
对于问题的根源,我会做出一个疯狂的猜测,并说这可能与此有关(
As for the source of the problem, i'll take a wild guess and say it might be related to this (as stated in Pig Documentation) combined with pig's run cycle optimizations :
如果您用一个内部模式为空的手提袋装箱,则所得关系的模式为空.
If you FLATTEN a bag with empty inner schema, the schema for the resulting relation is null.
对于您而言,我认为STRSPLIT结果的模式在运行前是未知的.
In your case, I believe schema of the STRSPLIT result is unknown until runtime.
edit2: 好的,这是我的理论解释:
edit2: Ok, here is my theory explained:
This is the complete -explain- output for script 2 and this is for script 3. I'll just paste the interesting parts here.
|---a2: (Name: LOForEach Schema: num#288:int,ch#289:chararray)
| | |
| | (Name: LOGenerate[false,false] Schema: num#288:int,ch#289:chararray)ColumnPrune:InputUids=[288, 289]ColumnPrune:OutputUids=[288, 289]
| | | |
| | | (Name: Cast Type: int Uid: 288)
| | | |
| | | |---num:(Name: Project Type: int Uid: 288 Input: 0 Column: (*))
以上部分适用于脚本2;看到最后一行.假定flatten(STRSPLIT)
的输出将具有类型为integer
的第一个元素(因为您以这种方式提供了架构).但实际上STRSPLIT
具有一个null
输出模式,该模式被视为bytearray
字段;因此flatten(STRSPLIT)
的输出实际上是(n:bytearray, c:bytearray)
.因为您提供了一个架构,所以Pig试图将Java强制转换(到a1
的输出)到num
字段;失败的原因是num
实际上是表示为bytearray的java String
.由于此Java广播失败,因此Pig甚至不会尝试在上面的行中进行显式强制转换.
Above section is for script 2; see the last line. It assumes output of flatten(STRSPLIT)
will have a first element of type integer
(because you provided the schema that way). But in fact STRSPLIT
has a null
output schema which is treated as bytearray
fields; so output of flatten(STRSPLIT)
is actually (n:bytearray, c:bytearray)
. Because you provided a schema, pig tries to make a java cast (to the output of a1
) to num
field; which fails as num
is in fact a java String
represented as bytearray. Since this java-cast fails, pig does not even try to make the explicit cast in the line above.
让我们看一下脚本3的情况:
Let's see the situation for script 3:
|---a2: (Name: LOForEach Schema: num#85:int,ch#87:bytearray)
| | |
| | (Name: LOGenerate[false,false] Schema: num#85:int,ch#87:bytearray)ColumnPrune:InputUids=[]ColumnPrune:OutputUids=[85, 87]
| | | |
| | | (Name: Cast Type: int Uid: 85)
| | | |
| | | |---(Name: Project Type: bytearray Uid: 85 Input: 0 Column: (*))
请参阅最后一行,此处a1
的输出被正确地视为bytearray
,这里没有问题.现在看倒数第二行;猪尝试(并成功)进行从bytearray
到integer
的显式强制转换操作.
See the last line, here output of a1
is properly treated as bytearray
, no problems here. And now look at the second to last line; pig tries (and succeeds) to make an explicit cast operation from bytearray
to integer
.
这篇关于猪拉丁语中flatten运算符的模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!