使用PigLatin删除重复项 [英] Removing duplicates using PigLatin

查看:74
本文介绍了使用PigLatin删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PigLatin过滤一些记录.

I'm using PigLatin to filter some records.

User1  8 NYC 
User1  9 NYC 
User1  7 LA 
User2  4 NYC
User2  3 DC 

该脚本应为用户删除重复项,并保留其中一条记录.类似于linux中的唯一命令.

The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.

输出应为:

User1 8 NYC 
User2 4 NYC

有什么建议吗?

推荐答案

对于您的特定示例,distinct效果不佳,因为您的输出包含所有输入列($0, $1, $2),因此您只能在具有列的投影上进行distinct ($0, $2)($0)并丢失$1.

For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2), you can do distinct only on a projection that has columns ($0, $2) or ($0) and lose $1.

为了为每个用户选择一条记录(任何记录),可以使用GROUP BY和嵌套的FOREACHLIMIT.例如:

In order to select one record per user (any record) you could use a GROUP BY and a nested FOREACH with LIMIT. Ex:

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

这种方法将帮助您获取在字段子集中唯一的记录,并限制每个用户可以控制的输出记录的数量.

This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.

这篇关于使用PigLatin删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆