Pig:如何连接嵌套包中的键上的数据 [英] Pig: How to join data on a key in a nested bag

查看:27
本文介绍了Pig:如何连接嵌套包中的键上的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是试图将 'value1'/'value2' 键上的 data2 值合并到 data1data1data2(注意

I'm simply trying to merge in the values from data2 to data1 on the 'value1'/'value2' keys seen in both data1 and data2 (note the nested structure of

容易吧?在面向对象的代码中,它是一个嵌套的 for 循环.但在 Pig 中,感觉就像是在解一个 rubix 立方体.

Easy right? In object oriented code it's a nested for loop. But in Pig it feels like solving a rubix cube.

data1 = 'item1'     111     { ('thing1', 222, {('value1'),('value2')}) }
data2 = 'value1'    'result1'
        'value2'    'result2'

A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );

expected: 'item1', 111, {('thing1', 222, {('value1','result1'), ('value2','result2')})}
                                           ^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^

好奇:data1 来自一个面向对象的数据存储,它解释了双重嵌套(简单的面向对象格式).

For the curious: data1 comes from an object oriented datastore, which explains the double nesting (simple object oriented format).

推荐答案

听起来您基本上只想进行连接(从问题中不清楚这应该是 INNER、LEFT、RIGHT 还是 FULL.我认为 @SNeumann基本上有写答案,但我会添加一些代码以使其更清晰.

It sounds like you basically just want to do a join (unclear from the question if this should be INNER, LEFT, RIGHT, or FULL. I think @SNeumann basically has the write answer, but I'll add some code to make it clearer.

假设数据如下:

data1 = 'item1'     111     { ('thing1', 222, {('value1'),('value2')}) }
        ...
data2 = 'value1'    'result1'
        'value2'    'result2'
        ...

我会做类似的事情(未经测试):

I'd do something like (untested):

A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
A_flattened = FOREACH A GENERATE item, d, things.thing AS thing; things.d1 AS d1, FLATTEN(things.values) AS value;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1'
--'item1', 111, 'thing1', 222, 'value2'
A_B_joined = JOIN A_flattened BY value, B BY v;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1', 'value1', 'result1'
--'item1', 111, 'thing1', 222, 'value1', 'value2', 'result2'
A_B_joined1 = FOREACH A_B_JOINED GENERATE item, d, thing, d1, A_flattened::value AS value, r AS result;
A_B_grouped = GROUP A_B_joined1 BY (value, result);

从那里开始,随心所欲地重新装袋应该是微不足道的.

From there, rebagging however you like should be trivial.

EDIT:上面应该使用了."作为元组上的投影算子.我已经切换了它.它还假设 things 是一个大元组,但它不是.这是一袋一件物品.如果 OP 从来不打算在那个包里放不止一件物品,我强烈建议改用元组并加载为:

EDIT: The above should have used '.' as the projection operator on tuples. I've switched that in. It also assumed things was a big tuple, which it isn't. It's a bag of one item. If the OP never plans to have more than one item in that bag, I'd highly recommend using a tuple instead and loading as:

A = load 'data1' as (item:chararray, d:int, things:(thing:chararray, d1:int, values:bag{(v:chararray)})); 

然后基本上按原样使用其余代码(注意:仍未测试).

and then using the rest of the code essentially as is (note: still untested).

如果一个包是绝对需要的,那么整个问题就会改变,当包中有多个 things 对象时,OP 想要发生什么就变得不清楚了.如此处

If a bag is absolutely required, then the entire problem changes, and it becomes unclear what the OP wants to happen when there are multiple things objects in the bag. Bag projection is also quite a bit more complicated as noted here

这篇关于Pig:如何连接嵌套包中的键上的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆