依靠多列的组合并获取原始数据集 [英] count on group by on multiple columns and getting the original dataset

查看:260
本文介绍了依靠多列的组合并获取原始数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

玉米片,普通磨,12
3,玉米片,混合坚果,邮政,14
4,巧克力糖浆,普通,好时,5
5,巧克力糖浆,无高果糖,贺尔雪,8
6,巧克力糖浆,普通,Ghirardeli,6
7,巧克力糖浆,草莓味,Ghirardeli,7

脚本

  data_grp = GROUP数据BY(item,type); 
data_cnt = FOREACH data_grp GENERATE FLATTEN(group)AS(item,type),count(data)as total;
filter_data = FILTER data_cnt BY total< 2;

我现在需要应用过滤器的原始数据,
我的理想输出是: / p>

  4,巧克力糖浆,Regular,Hersheys,5 
6,巧克力糖浆,Regular,Ghirardeli,6


解决方案

filter_data会给你巧克力糖浆, 。使用item添加原始数据集的filter_data,键入并获得所需的结果。

  data_grp = GROUP数据BY(item,type); 
data_cnt = FOREACH data_grp GENERATE FLATTEN(group)AS(item,type),COUNT(data)as total;
filter_data = FILTER data_cnt BY total< 2;
o_data =加入数据BY(item,type),filter_data BY($ 0,$ 1);
final_data = FOREACH o_data GENERATE $ 0 .. $ 4;
DUMP final_data;


2, cornflakes, Regular,General Mills, 12    
3, cornflakes, Mixed Nuts, Post, 14  
4, chocolate syrup, Regular, Hersheys, 5   
5, chocolate syrup, No High Fructose, Hersheys, 8  
6, chocolate syrup, Regular, Ghirardeli, 6  
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

Script

data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), count(data) as total; 
filter_data = FILTER data_cnt BY total < 2;

I now need the original data with the filter applied and my desired output is:

4, chocolate syrup, Regular, Hersheys, 5
6, chocolate syrup, Regular, Ghirardeli, 6

解决方案

filter_data will give you chocolate syrup, Regular.Join the filter_data with original dataset with item,type and get the desired result.

data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), COUNT(data) as total; 
filter_data = FILTER data_cnt BY total < 2;
o_data = JOIN data BY (item,type),filter_data BY ($0,$1);
final_data = FOREACH o_data GENERATE $0..$4;
DUMP final_data;

这篇关于依靠多列的组合并获取原始数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆