Pig - 如何迭代一袋地图 [英] Pig - how to iterate on a bag of maps
问题描述
让我解释一下这个问题.我有这行代码:
u = FOREACH people GENERATE FLATTEN($0#'experiences') as j;倾倒你;
产生这个输出:
([id#1,date_begin#12 2012,description#blabla,date_end#04 2013],[id#2,date_begin#02 2011,description#blabla2,date_end#04 2013])([id#1,date_begin#12 2011,description#blabla3,date_end#04 2012],[id#2,date_begin#02 2010,description#blabla4,date_end#04 2011])
然后,当我这样做时:
p = foreach u generate j#'id', j#'description';转储 p;
我有这个输出:
(1,blabla)(1,blabla3)
但这不是我想要的.我想要这样的输出:
(1,blabla)(2,blabla2)(1,blabla3)(2,blabla4)
我怎么会有这个?
非常感谢.
我假设您在 u
中 FLATTEN
ing 的 $0 是一个元组.>
总体问题是 j
只引用元组中的第一个映射.为了得到你想要的输出,你必须将每个元组转换成一个包,然后 FLATTEN
它.
如果你知道每个元组最多有两个映射,你可以这样做:
-- 我的 B 是你的你B = FOREACH A GENERATE (tuple(map[],map[]))$0#'experiences' AS T ;B2 = FOREACH B GENERATE FLATTEN(TOBAG(T.$0, T.$1)) AS j ;C = foreach B2 生成 j#'id', j#'description' ;
如果您不知道元组中有多少个字段,那么这将困难.
<小时>注意:这适用于 pig 0.10.
对于具有未定义数量映射的元组,我能想到的最佳答案是使用 UDF 来解析字节数组:
myudf.py
@outputSchema('vals: {(val:map[])}')def foo(the_input):# 这将不确定数量的地图转换成一个包.foo = [chr(i) for i in the_input]foo = ''.join(foo).strip('()')出 = []对于 foo.split('],[') 中的 f:f = f.strip('[]')out.append(dict((k, v) for k, v in [ i.split('#') for i in f.split(',')]))回来
myscript.pig
使用 jython 作为 myudf 注册myudf.py";B = FOREACH A GENERATE FLATTEN($0#'experiences');T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;T2 = FOREACH T1 GENERATE M#'id', M#'description';
然而,这依赖于 #
、、
或 ],[
不会出现在任何键或值中的事实在地图上.
注意:这适用于 pig 0.11.
因此,在这种情况下,pig 处理 Python UDF 输入的方式似乎发生了变化.字节数组不是 foo
的输入,而是自动转换为适当的类型.在这种情况下,一切都会变得更容易:
myudf.py
@outputSchema('vals: {(val:map[])}')def foo(the_input):# 这将不确定数量的地图转换成一个包.出 = []对于 the_input 中的地图:out.append(地图)回来
myscript.pig
使用 jython 作为 myudf 注册myudf.py";# 这次你应该传入整个元组.B = FOREACH A GENERATE $0#'experiences' ;T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;T2 = FOREACH T1 GENERATE M#'id', M#'description';
Let me explain the problem. I have this line of code:
u = FOREACH persons GENERATE FLATTEN($0#'experiences') as j;
dump u;
which produces this output:
([id#1,date_begin#12 2012,description#blabla,date_end#04 2013],[id#2,date_begin#02 2011,description#blabla2,date_end#04 2013])
([id#1,date_begin#12 2011,description#blabla3,date_end#04 2012],[id#2,date_begin#02 2010,description#blabla4,date_end#04 2011])
Then, when I do this:
p = foreach u generate j#'id', j#'description';
dump p;
I have this output:
(1,blabla)
(1,blabla3)
But that's not what I wanted. I would like to have an output like this:
(1,blabla)
(2,blabla2)
(1,blabla3)
(2,blabla4)
How could I have this ?
Thank you very much.
I'm assuming that the $0 you are FLATTEN
ing in u
is a tuple.
The overall problem is that j
is only referencing the first map in the tuple. In order to get the output you want, you'll have to convert each tuple into a bag, then FLATTEN
it.
If you know that each tuple will have up to two maps, you can do:
-- My B is your u
B = FOREACH A GENERATE (tuple(map[],map[]))$0#'experiences' AS T ;
B2 = FOREACH B GENERATE FLATTEN(TOBAG(T.$0, T.$1)) AS j ;
C = foreach B2 generate j#'id', j#'description' ;
If you don't know how many fields will be in the tuple, then this is will be much harder.
NOTE: This works for pig 0.10.
For tuples with an undefined number of maps, the best answer I can think of is using a UDF to parse the bytearray:
myudf.py
@outputSchema('vals: {(val:map[])}')
def foo(the_input):
# This converts the indeterminate number of maps into a bag.
foo = [chr(i) for i in the_input]
foo = ''.join(foo).strip('()')
out = []
for f in foo.split('],['):
f = f.strip('[]')
out.append(dict((k, v) for k, v in [ i.split('#') for i in f.split(',')]))
return out
myscript.pig
register 'myudf.py' using jython as myudf ;
B = FOREACH A GENERATE FLATTEN($0#'experiences') ;
T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;
T2 = FOREACH T1 GENERATE M#'id', M#'description' ;
However, this relies on the fact that #
, ,
, or ],[
will not appear in any of the keys or values in the map.
NOTE: This works for pig 0.11.
So it seems that how pig handles the input to the python UDFs changed in this case. Instead of a bytearray being the input to foo
, the bytearray is automatically converted to the appropriate type. In that case it makes everything much easier:
myudf.py
@outputSchema('vals: {(val:map[])}')
def foo(the_input):
# This converts the indeterminate number of maps into a bag.
out = []
for map in the_input:
out.append(map)
return out
myscript.pig
register 'myudf.py' using jython as myudf ;
# This time you should pass in the entire tuple.
B = FOREACH A GENERATE $0#'experiences' ;
T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;
T2 = FOREACH T1 GENERATE M#'id', M#'description' ;
这篇关于Pig - 如何迭代一袋地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!