猪编程使用分组计数(*) [英] pig programming to use split on group by having count(*)

查看:150
本文介绍了猪编程使用分组计数(*)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

输入文件是:

  2,玉米片,普通磨粉机,12美元b 3美元,玉米片,混合坚果,邮政,14 
4,巧克力糖浆,普通,Hersheys,5
5,巧克力糖浆,无高果糖,贺喜饼,8
6,巧克力糖浆,普通,Ghirardeli,6美元b $ b 7,巧克力糖浆,草莓味,Ghirardeli,7

filter3 =使用PigStorage('\t')作为LOAD'location_of_file'(item_sl:int,item:chararray,type:chararray ,制造商:chararray,价格:诠释);

SPLIT filter3 INTO filter4 IF(FOREACH(filter3 GROUP BY item)GENERATE group,COUNT(item< 3)),filter6_pass OTHERWISE;

这就好像有一个带有一个group by by项的count(*)< 3



所需的输出是:

  4,巧克力糖浆,常规,Hersheys,5 
5,巧克力糖浆,高果糖,赫尔希什,8
6,巧克力糖浆,普通,Ghirardeli,6
7,巧克力糖浆,草莓味,Ghirardeli,7


解决方案

按项目分组,然后使用过滤器在计数中使用PigStorage('\ t')作为(item_sl:int,item: chararray,类型:chararray,制造商:chararray,价格:int);
B = GROUP A BY item;
C = FOREACH B GENERATE组,COUNT(A.item)AS总计;
D = FILTER C BY总计> 3;
E =加入一个BY项目,D BY $ 0;
F = FOREACH E GENERATE $ 0 .. $ 4;
DUMP F;


Input file is:

2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

filter3 = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);

SPLIT filter3 INTO filter4 IF (FOREACH (filter3 GROUP BY item) GENERATE group, COUNT(item < 3)), filter6_pass OTHERWISE;

It is like having a SQL with a group by on item having count(*) < 3

The desired output is:

4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

解决方案

Group by item, get the count and then use filter on the count

A = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
B = GROUP A BY item;
C = FOREACH B GENERATE group,COUNT(A.item) AS Total;
D = FILTER C BY Total > 3;
E = JOIN A BY item,D BY $0;
F = FOREACH E GENERATE $0..$4;
DUMP F;

这篇关于猪编程使用分组计数(*)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆