Pig 10.0 - 将元组分组并且合并到foreach中 [英] Pig 10.0 - group the tuples and merge bags in a foreach

查看:312
本文介绍了Pig 10.0 - 将元组分组并且合并到foreach中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Pig 10.0 。我想在foreach中合并包。假设我有以下访问者别名:

 (a,b (a,e,{7}),
(z,{1,2,3,4}),
(a,d,{1,3,6}),
,b,{1,2,3})

我想在第一个字段并将这些包与一组语义合并以获得以下元组:

 ({1,2,3,4,6 ,7},a,6)
({1,2,3},z,3)

第一个领域是袋子与一组语义的结合。元组的第二个字段是组字段。第三个字段是包中的数字项。



我尝试了以下代码的几个变体(由Group / Distinct等取代SetUnion),但始终未能实现想要的行为:

  DEFINE SetUnion datafu.pig.bags.sets.SetUnion(); 

分组= =(FirstField)的GROUP访客;
merged = FOREACH分组{
VU = SetUnion(visitors.ThirdField);
生成
VU作为Vu,
组作为FirstField,
COUNT(VU)作为Cnt;
}
转储合并;

你能解释我错在哪里以及如何实现所需的行为吗?

解决方案

我终于实现了想要的行为。我的解决方案的一个自包含的例子如下:

数据文件:

  ab 1 
ab 2
ab 3
ab 4
ad 1
ab 3
ab 6
ae 7
zb 1
zb 2
zb 3

代码:

   - 准备数据
in = LOAD'data'使用PigStorage()
AS(一个:chararray,Two:chararray,Id:long);

grp = GROUP in by(One,Two);
cnt = FOREACH grp {
ids = DISTINCT in.Id;
生成
ids作为Ids,
组。一个为一个,
组。一个为两个,
COUNT(ids)为计数;
}

- 有趣的代码如下
grp2 = GROUP by cnt;
cnt2 = FOREACH grp2 {
ids = FOREACH cnt.Ids生成FLATTEN($ 0);
生成
ids作为ID,
组合为一,
COUNT(ids)作为计数;
}

描述cnt2;
转储grp2;
转储cnt2;

描述:

  Cnt:{Ids:{(Id:long)},One:chararray,Two:chararray,Count:long} 

grp2:

 (a,{({(1),(2), (1),(4),(6)},a,b,5) $ b(z,{({(1),(2),(3)},z,b,3)})



cnt2:


 ({(1),(2),(3),(4 ),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)


$ b

由于代码使用嵌套在FOREACH中的FOREACH,它需要Pig> 10.0。



我会让这个问题在可能存在更干净的解决方案后的几天内解决。


I'm using Pig 10.0. I want to Merge bags in a foreach. Let's say I have the following visitors alias:

(a, b, {1, 2, 3, 4}),
(a, d, {1, 3, 6}),
(a, e, {7}),
(z, b, {1, 2, 3})

I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples:

({1, 2, 3, 4, 6, 7}, a, 6) 
({1, 2, 3}, z, 3) 

The first field is the union of the bags with a set semantic. The second field of the tuple is the group field. The third field is the number items in the bag.

I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior:

DEFINE SetUnion        datafu.pig.bags.sets.SetUnion();

grouped = GROUP visitors by (FirstField);
merged = FOREACH grouped {
    VU = SetUnion(visitors.ThirdField);
    GENERATE 
        VU        as Vu,
        group     as FirstField,
        COUNT(VU) as Cnt;
    }
dump merged;

Can you explain where I'm wrong and how to implement the desired behavior?

解决方案

I finally managed to achieve the wanted behavior. A self contained example of my solution follows:

Data file:

a       b       1
a       b       2
a       b       3
a       b       4
a       d       1
a       b       3
a       b       6
a       e       7
z       b       1
z       b       2
z       b       3

Code:

-- Prepare data
in = LOAD 'data' USING PigStorage() 
        AS (One:chararray, Two:chararray, Id:long);

grp = GROUP in by (One, Two);
cnt = FOREACH grp {
        ids = DISTINCT in.Id;
        GENERATE
                ids        as Ids,
                group.One  as One,
                group.Two  as Two,
                COUNT(ids) as Count;
}       

-- Interesting code follows
grp2 = GROUP cnt by One;
cnt2 = FOREACH grp2 {
        ids = FOREACH cnt.Ids generate FLATTEN($0);
        GENERATE
                ids  as Ids,
                group      as One,
                COUNT(ids) as Count;
}               

describe cnt2;
dump grp2;
dump cnt2;

Describe:

Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}

grp2:

(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)})
(z,{({(1),(2),(3)},z,b,3)})

cnt2:

({(1),(2),(3),(4),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)

Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0.

I will let the question as unresolved for a few days since a cleaner solution probably exists.

这篇关于Pig 10.0 - 将元组分组并且合并到foreach中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆