PIG:从分组包中取出所有元组 [英] PIG: Get all tuples out of a grouped bag

查看:27
本文介绍了PIG:从分组包中取出所有元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PIG 从元组生成组,如下所示:

I am using PIG to generate groups from tuples as follows:

a1, b1
a1, b2
a1, b3
...

->

a1, [b1, b2, b3]
...

这很容易而且有效.但我的问题是得到以下内容:从获得的组中,我想在组的包中生成一组所有元组:

This is easy and working. But my problem is to get the following: From the obtained groups, I would like to generate a set of all tuples in the group's bag:

a1, [b1, b2, b3]

->

b1,b2
b1,b3
b2,b3

如果我可以嵌套foreach"并首先遍历每个组然后遍历它的包,这将很容易.

This would be easy if I could nest "foreach" and firstly iterate over each group and then over its bag.

我想我误解了这个概念,我会感谢你的解释.

I suppose I am misunderstanding the concept and I will appreciate your explanation.

谢谢.

推荐答案

看起来您需要在包和包之间使用笛卡尔积.为此,您需要使用 FLATTEN(bag) 两次.

It looks like you need a Cartesian product between the bag and itself. To do this you need to use FLATTEN(bag) twice.

代码:

inpt = load '.../group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as value_bag;
result = foreach id_grp generate id, FLATTEN(value_bag) as v1, FLATTEN(value_bag) as v2; 
dump result;

请注意,大袋子会产生很多行.为避免这种情况,您可以在 FLATTEN 之前使用 TOP(...):

Be aware that large bags will produce a lot of rows. To avoid it you could use TOP(...) before FLATTEN:

inpt = load '....group.txt' using PigStorage(',')  as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
    generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2; 
};
dump result;

对于您的特定输出,您可以在 FLATTEN 之前使用一些过滤:

For your specific output you could use some filtering before FLATTEN:

inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    l = filter values by val == 'b1' or val == 'b2';
    generate id, FLATTEN(l) as v1, FLATTEN(values) as v2; 
};
result = filter result by v1 != v2;

希望能帮到你.

干杯

这篇关于PIG:从分组包中取出所有元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆