Pig 10.0 - 将元组分组并在 foreach 中合并包 [英] Pig 10.0 - group the tuples and merge bags in a foreach
问题描述
我正在使用 Pig 10.0
.我想在 foreach 中合并包.假设我有以下 visitors
别名:
I'm using Pig 10.0
. I want to Merge bags in a foreach. Let's say I have the following visitors
alias:
(a, b, {1, 2, 3, 4}),
(a, d, {1, 3, 6}),
(a, e, {7}),
(z, b, {1, 2, 3})
我想对第一个字段上的元组进行分组,并使用集合语义合并包以获得以下元组:
I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples:
({1, 2, 3, 4, 6, 7}, a, 6)
({1, 2, 3}, z, 3)
第一个字段是具有集合语义的包的并集.元组的第二个字段是组字段.第三个字段是袋子里的物品数量.
The first field is the union of the bags with a set semantic. The second field of the tuple is the group field. The third field is the number items in the bag.
我围绕以下代码尝试了多种变体(将 SetUnion 替换为 Group/Distinct 等),但始终未能实现所需的行为:
I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior:
DEFINE SetUnion datafu.pig.bags.sets.SetUnion();
grouped = GROUP visitors by (FirstField);
merged = FOREACH grouped {
VU = SetUnion(visitors.ThirdField);
GENERATE
VU as Vu,
group as FirstField,
COUNT(VU) as Cnt;
}
dump merged;
您能解释一下我错在哪里以及如何实现所需的行为吗?
Can you explain where I'm wrong and how to implement the desired behavior?
推荐答案
我终于实现了想要的行为.我的解决方案的一个自包含示例如下:
I finally managed to achieve the wanted behavior. A self contained example of my solution follows:
数据文件:
a b 1
a b 2
a b 3
a b 4
a d 1
a b 3
a b 6
a e 7
z b 1
z b 2
z b 3
代码:
-- Prepare data
in = LOAD 'data' USING PigStorage()
AS (One:chararray, Two:chararray, Id:long);
grp = GROUP in by (One, Two);
cnt = FOREACH grp {
ids = DISTINCT in.Id;
GENERATE
ids as Ids,
group.One as One,
group.Two as Two,
COUNT(ids) as Count;
}
-- Interesting code follows
grp2 = GROUP cnt by One;
cnt2 = FOREACH grp2 {
ids = FOREACH cnt.Ids generate FLATTEN($0);
GENERATE
ids as Ids,
group as One,
COUNT(ids) as Count;
}
describe cnt2;
dump grp2;
dump cnt2;
描述:
Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}
grp2:
(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)})
(z,{({(1),(2),(3)},z,b,3)})
cnt2:
({(1),(2),(3),(4),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)
由于代码使用嵌套在 FOREACH 中的 FOREACH,因此需要 Pig > 10.0.
Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0.
由于可能存在更简洁的解决方案,因此我会将问题搁置几天.
I will let the question as unresolved for a few days since a cleaner solution probably exists.
这篇关于Pig 10.0 - 将元组分组并在 foreach 中合并包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!