使用PigLatin删除重复项 [英] Removing duplicates using PigLatin

查看：74 发布时间：2020/9/3 20:01:47 apache-pig

本文介绍了使用PigLatin删除重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用PigLatin过滤一些记录.

I'm using PigLatin to filter some records.

User1  8 NYC 
User1  9 NYC 
User1  7 LA 
User2  4 NYC
User2  3 DC

该脚本应为用户删除重复项，并保留其中一条记录.类似于linux中的唯一命令.

The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.

输出应为:

User1 8 NYC 
User2 4 NYC

有什么建议吗?

推荐答案

对于您的特定示例，distinct效果不佳，因为您的输出包含所有输入列($0, $1, $2)，因此您只能在具有列的投影上进行distinct ($0, $2)或($0)并丢失$1.

For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2), you can do distinct only on a projection that has columns ($0, $2) or ($0) and lose $1.

为了为每个用户选择一条记录(任何记录)，可以使用GROUP BY和嵌套的FOREACH和LIMIT.例如:

In order to select one record per user (any record) you could use a GROUP BY and a nested FOREACH with LIMIT. Ex:

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

这种方法将帮助您获取在字段子集中唯一的记录，并限制每个用户可以控制的输出记录的数量.

This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.

这篇关于使用PigLatin删除重复项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用PigLatin删除重复项 [英] Removing duplicates using PigLatin

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用PigLatin删除重复项 [英] Removing duplicates using PigLatin

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭