猪 - 如何迭代一袋地图 [英] Pig - how to iterate on a bag of maps

查看:103
本文介绍了猪 - 如何迭代一袋地图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我解释一下这个问题。我有这样的代码行:

  u = FOREACH人将GENERATE FLATTEN($ 0#'experience')作为j; 
转储u;

产生这个输出:



<$ p $ ($#c#),($#$,$#$,$#$,$# b $ b([id#1,date_begin#12 2011,description#blabla3,date_end#04 2012],[id#2,date_begin#02 2010,description#blabla4,date_end#04 2011])$ ​​b $ b

然后,当我这样做时:

  p = foreach u生成j#'id',j#'description'; 
转储p;

我有这样的输出:

 (1,blabla)
(1,blabla3)

但那不是我想要的。我想要一个这样的输出:

$ p $ (1,blabla)
(2,blabla2)
(1,blabla3)
(2,blabla4)

我怎么能拥有这个



非常感谢。 我假设



$ b $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' b

总体问题是 j 仅引用元组中的第一个映射。为了得到你想要的输出,你必须将每个元组转换成一个包,然后 FLATTEN 它。



如果你知道每个元组最多有两个映射,你可以这样做:

   - 我的B是你的
B = FOREACH一个GENERATE(元组(map [],map []))$ 0#'experience'AS T;
B2 = FOREACH B GENERATE FLATTEN(TOBAG(T. $ 0,T. $ 1))AS j;

C = foreach B2生成j#'id',j#'description';

如果你不知道元组中有多少个字段,那么这将是<强>很多。






注意:



对于映射数量未定的元组,我能想到的最佳答案是使用UDF来解析字节数组:

myudf.py

  @outputSchema('vals:{(val:map [])}')
def foo(the_input):
#这会将不确定数量的地图转换为包。
foo = [chr(i)for the in_input]
foo =''.join(foo).strip('()')
out = []
for f in foo.split('],['):
f = f.strip('[]')
out.append(dict((k,v)for k,v in [i。 (',')]))
返回

myscript.pig

 使用jython作为myudf注册'myudf.py'; 
B = FOREACH A GENERATE FLATTEN($ 0#'experience');

T1 = FOREACH B GENERATE FLATTEN(myudf.foo($ 0))AS M;
T2 = FOREACH T1 GENERATE M#'id',M#'description';

然而,这依赖于],[不会出现在地图中的任何键或值中。 p>




注意:适用于0.11猪。
$ b $因此,似乎猪在这种情况下如何处理python UDF的输入。字节阵列不是输入到 foo 的字节阵列,而是自动转换为适当的类型。在这种情况下,它使一切变得容易:

myudf.py



<$ p $ ($ val $:{val:map [])}')
def foo(the_input):
#这会将不确定的地图数量转换为一个包。
out = []
for the_input中的地图:
out.append(map)
返回

myscript.pig

  register'myudf。 py'使用jython作为myudf; 

#这次你应该传入整个元组。
B = FOREACH A GENERATE $ 0#'experience';

T1 = FOREACH B GENERATE FLATTEN(myudf.foo($ 0))AS M;
T2 = FOREACH T1 GENERATE M#'id',M#'description';


Let me explain the problem. I have this line of code:

u = FOREACH persons GENERATE FLATTEN($0#'experiences') as j;
dump u;

which produces this output:

([id#1,date_begin#12 2012,description#blabla,date_end#04 2013],[id#2,date_begin#02 2011,description#blabla2,date_end#04 2013])
([id#1,date_begin#12 2011,description#blabla3,date_end#04 2012],[id#2,date_begin#02 2010,description#blabla4,date_end#04 2011])

Then, when I do this:

p = foreach u generate j#'id', j#'description';
dump p;

I have this output:

(1,blabla)
(1,blabla3)

But that's not what I wanted. I would like to have an output like this:

(1,blabla)
(2,blabla2)
(1,blabla3)
(2,blabla4)

How could I have this ?

Thank you very much.

解决方案

I'm assuming that the $0 you are FLATTENing in u is a tuple.

The overall problem is that j is only referencing the first map in the tuple. In order to get the output you want, you'll have to convert each tuple into a bag, then FLATTEN it.

If you know that each tuple will have up to two maps, you can do:

-- My B is your u
B = FOREACH A GENERATE (tuple(map[],map[]))$0#'experiences' AS T ;
B2 = FOREACH B GENERATE FLATTEN(TOBAG(T.$0, T.$1)) AS j ;

C = foreach B2 generate j#'id', j#'description' ;

If you don't know how many fields will be in the tuple, then this is will be much harder.


NOTE: This works for pig 0.10.

For tuples with an undefined number of maps, the best answer I can think of is using a UDF to parse the bytearray:

myudf.py

@outputSchema('vals: {(val:map[])}')
def foo(the_input):
    # This converts the indeterminate number of maps into a bag.
    foo = [chr(i) for i in the_input]
    foo = ''.join(foo).strip('()')
    out = []
    for f in foo.split('],['):
        f = f.strip('[]')
        out.append(dict((k, v) for k, v in [ i.split('#') for i in f.split(',')]))
    return out

myscript.pig

register 'myudf.py' using jython as myudf ;
B = FOREACH A GENERATE FLATTEN($0#'experiences') ;

T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;
T2 = FOREACH T1 GENERATE M#'id', M#'description' ;

However, this relies on the fact that #, ,, or ],[ will not appear in any of the keys or values in the map.


NOTE: This works for pig 0.11.

So it seems that how pig handles the input to the python UDFs changed in this case. Instead of a bytearray being the input to foo, the bytearray is automatically converted to the appropriate type. In that case it makes everything much easier:

myudf.py

@outputSchema('vals: {(val:map[])}')
def foo(the_input):
    # This converts the indeterminate number of maps into a bag.
    out = []
    for map in the_input:
        out.append(map)
    return out

myscript.pig

register 'myudf.py' using jython as myudf ;

# This time you should pass in the entire tuple.
B = FOREACH A GENERATE $0#'experiences' ;

T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;
T2 = FOREACH T1 GENERATE M#'id', M#'description' ;

这篇关于猪 - 如何迭代一袋地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆