猪 - 如何迭代一袋地图 [英] Pig - how to iterate on a bag of maps

查看：103 发布时间：2018/5/31 20:02:03 hadoop bigdata apache-pig

本文介绍了猪 - 如何迭代一袋地图的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

让我解释一下这个问题。我有这样的代码行：

  u = FOREACH人将GENERATE FLATTEN（$ 0＃'experience'）作为j; 
转储u;

产生这个输出：

<$ p $ （$＃c＃），（$＃$，$＃$，$＃$，$＃ b $ b（[id＃1，date_begin＃12 2011，description＃blabla3，date_end＃04 2012]，[id＃2，date_begin＃02 2010，description＃blabla4，date_end＃04 2011]）$ b $ b

 然后，当我这样做时： 
 
 
  p = foreach u生成j＃'id'，j＃'description'; 
转储p; 
  
我有这样的输出： 
 
 
 （1，blabla）
（1，blabla3）
  
但那不是我想要的。我想要一个这样的输出：
 
 $ p $ （1，blabla）
（2，blabla2）
（1，blabla3）
（2，blabla4）
  
我怎么能拥有这个
 
 
 非常感谢。   我假设
 
 
 
 $ b $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' b 
总体问题是 j 仅引用元组中的第一个映射。为了得到你想要的输出，你必须将每个元组转换成一个包，然后 FLATTEN 它。 
 
 
 如果你知道每个元组最多有两个映射，你可以这样做： 
 
 
   - 我的B是你的
 B = FOREACH一个GENERATE（元组（map []，map []））$ 0＃'experience'AS T; 
 B2 = FOREACH B GENERATE FLATTEN（TOBAG（T. $ 0，T. $ 1））AS j; 
 
 C = foreach B2生成j＃'id'，j＃'description'; 
  
如果你不知道元组中有多少个字段，那么这将是<强>很多。
 
 
 
 
 
  注意： 
 
 
 对于映射数量未定的元组，我能想到的最佳答案是使用UDF来解析字节数组：
 
 
  myudf.py  
  @outputSchema（'vals：{（val：map []）}'）
 def foo（the_input）：
＃这会将不确定数量的地图转换为包。 
 foo = [chr（i）for the in_input] 
 foo =''.join（foo）.strip（'（）'）
 out = [] 
 for f in foo.split（']，['）：
f = f.strip（'[]'）
 out.append（dict（（k，v）for k，v in [i。 （'，'）]））
返回
  
  myscript.pig   
 
 
 使用jython作为myudf注册'myudf.py'; 
 B = FOREACH A GENERATE FLATTEN（$ 0＃'experience'）; 
 
 T1 = FOREACH B GENERATE FLATTEN（myudf.foo（$ 0））AS M; 
 T2 = FOREACH T1 GENERATE M＃'id'，M＃'description'; 
  
然而，这依赖于＃，，或]，[不会出现在地图中的任何键或值中。 p> 
 
 
 
 
  注意：适用于0.11猪。 
 $ b $因此，似乎猪在这种情况下如何处理python UDF的输入。字节阵列不是输入到 foo 的字节阵列，而是自动转换为适当的类型。在这种情况下，它使一切变得容易： 
 
 
  myudf.py  
 
 
 <$ p $ （$ val $：{val：map []）}'）
 def foo（the_input）：
＃这会将不确定的地图数量转换为一个包。 
 out = [] 
 for the_input中的地图：
 out.append（map）
返回

myscript.pig
register'myudf。 py'使用jython作为myudf; ＃这次你应该传入整个元组。 B = FOREACH A GENERATE $ 0＃'experience'; T1 = FOREACH B GENERATE FLATTEN（myudf.foo（$ 0））AS M; T2 = FOREACH T1 GENERATE M＃'id'，M＃'description';

Let me explain the problem. I have this line of code:
u = FOREACH persons GENERATE FLATTEN($0#'experiences') as j; dump u;
which produces this output:
([id#1,date_begin#12 2012,description#blabla,date_end#04 2013],[id#2,date_begin#02 2011,description#blabla2,date_end#04 2013]) ([id#1,date_begin#12 2011,description#blabla3,date_end#04 2012],[id#2,date_begin#02 2010,description#blabla4,date_end#04 2011])
Then, when I do this:
p = foreach u generate j#'id', j#'description'; dump p;
I have this output:
(1,blabla) (1,blabla3)
But that's not what I wanted. I would like to have an output like this:
(1,blabla) (2,blabla2) (1,blabla3) (2,blabla4)
How could I have this ?

Thank you very much.
解决方案
I'm assuming that the $0 you are FLATTENing in u is a tuple.

The overall problem is that j is only referencing the first map in the tuple. In order to get the output you want, you'll have to convert each tuple into a bag, then FLATTEN it.

If you know that each tuple will have up to two maps, you can do:
-- My B is your u B = FOREACH A GENERATE (tuple(map[],map[]))$0#'experiences' AS T ; B2 = FOREACH B GENERATE FLATTEN(TOBAG(T.$0, T.$1)) AS j ; C = foreach B2 generate j#'id', j#'description' ;
If you don't know how many fields will be in the tuple, then this is will be much harder.

NOTE: This works for pig 0.10.

For tuples with an undefined number of maps, the best answer I can think of is using a UDF to parse the bytearray:

myudf.py
@outputSchema('vals: {(val:map[])}') def foo(the_input): # This converts the indeterminate number of maps into a bag. foo = [chr(i) for i in the_input] foo = ''.join(foo).strip('()') out = [] for f in foo.split('],['): f = f.strip('[]') out.append(dict((k, v) for k, v in [ i.split('#') for i in f.split(',')])) return out
myscript.pig
register 'myudf.py' using jython as myudf ; B = FOREACH A GENERATE FLATTEN($0#'experiences') ; T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ; T2 = FOREACH T1 GENERATE M#'id', M#'description' ;
However, this relies on the fact that #, ,, or ],[ will not appear in any of the keys or values in the map.

NOTE: This works for pig 0.11.

So it seems that how pig handles the input to the python UDFs changed in this case. Instead of a bytearray being the input to foo, the bytearray is automatically converted to the appropriate type. In that case it makes everything much easier:

myudf.py
@outputSchema('vals: {(val:map[])}') def foo(the_input): # This converts the indeterminate number of maps into a bag. out = [] for map in the_input: out.append(map) return out
myscript.pig
register 'myudf.py' using jython as myudf ; # This time you should pass in the entire tuple. B = FOREACH A GENERATE $0#'experiences' ; T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ; T2 = FOREACH T1 GENERATE M#'id', M#'description' ;

这篇关于猪 - 如何迭代一袋地图的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

猪 - 如何迭代一袋地图 [英] Pig - how to iterate on a bag of maps

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

猪 - 如何迭代一袋地图 [英] Pig - how to iterate on a bag of maps

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭