Pig 10.0 - 将元组分组并在 foreach 中合并包 [英] Pig 10.0 - group the tuples and merge bags in a foreach

查看:22
本文介绍了Pig 10.0 - 将元组分组并在 foreach 中合并包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Pig 10.0.我想在 foreach 中合并包.假设我有以下 visitors 别名:

I'm using Pig 10.0. I want to Merge bags in a foreach. Let's say I have the following visitors alias:

(a, b, {1, 2, 3, 4}),
(a, d, {1, 3, 6}),
(a, e, {7}),
(z, b, {1, 2, 3})

我想对第一个字段上的元组进行分组,并使用集合语义合并包以获得以下元组:

I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples:

({1, 2, 3, 4, 6, 7}, a, 6) 
({1, 2, 3}, z, 3) 

第一个字段是具有集合语义的包的并集.元组的第二个字段是组字段.第三个字段是袋子里的物品数量.

The first field is the union of the bags with a set semantic. The second field of the tuple is the group field. The third field is the number items in the bag.

我围绕以下代码尝试了多种变体(将 SetUnion 替换为 Group/Distinct 等),但始终未能实现所需的行为:

I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior:

DEFINE SetUnion        datafu.pig.bags.sets.SetUnion();

grouped = GROUP visitors by (FirstField);
merged = FOREACH grouped {
    VU = SetUnion(visitors.ThirdField);
    GENERATE 
        VU        as Vu,
        group     as FirstField,
        COUNT(VU) as Cnt;
    }
dump merged;

您能解释一下我错在哪里以及如何实现所需的行为吗?

Can you explain where I'm wrong and how to implement the desired behavior?

推荐答案

我终于实现了想要的行为.我的解决方案的一个自包含示例如下:

I finally managed to achieve the wanted behavior. A self contained example of my solution follows:

数据文件:

a       b       1
a       b       2
a       b       3
a       b       4
a       d       1
a       b       3
a       b       6
a       e       7
z       b       1
z       b       2
z       b       3

代码:

-- Prepare data
in = LOAD 'data' USING PigStorage() 
        AS (One:chararray, Two:chararray, Id:long);

grp = GROUP in by (One, Two);
cnt = FOREACH grp {
        ids = DISTINCT in.Id;
        GENERATE
                ids        as Ids,
                group.One  as One,
                group.Two  as Two,
                COUNT(ids) as Count;
}       

-- Interesting code follows
grp2 = GROUP cnt by One;
cnt2 = FOREACH grp2 {
        ids = FOREACH cnt.Ids generate FLATTEN($0);
        GENERATE
                ids  as Ids,
                group      as One,
                COUNT(ids) as Count;
}               

describe cnt2;
dump grp2;
dump cnt2;

描述:

Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}

grp2:

(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)})
(z,{({(1),(2),(3)},z,b,3)})

cnt2:

({(1),(2),(3),(4),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)

由于代码使用嵌套在 FOREACH 中的 FOREACH,因此需要 Pig > 10.0.

Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0.

由于可能存在更简洁的解决方案,因此我会将问题搁置几天.

I will let the question as unresolved for a few days since a cleaner solution probably exists.

这篇关于Pig 10.0 - 将元组分组并在 foreach 中合并包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆