使用PigLatin删除重复项 [英] Removing duplicates using PigLatin
问题描述
我正在使用PigLatin过滤一些记录.
I'm using PigLatin to filter some records.
User1 8 NYC
User1 9 NYC
User1 7 LA
User2 4 NYC
User2 3 DC
该脚本应为用户删除重复项,并保留其中一条记录.类似于linux中的唯一命令.
The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.
输出应为:
User1 8 NYC
User2 4 NYC
有什么建议吗?
推荐答案
对于您的特定示例,distinct效果不佳,因为您的输出包含所有输入列($0, $1, $2)
,因此您只能在具有列的投影上进行distinct ($0, $2)
或($0)
并丢失$1
.
For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2)
, you can do distinct only on a projection that has columns ($0, $2)
or ($0)
and lose $1
.
为了为每个用户选择一条记录(任何记录),可以使用GROUP BY
和嵌套的FOREACH
和LIMIT
.例如:
In order to select one record per user (any record) you could use a GROUP BY
and a nested FOREACH
with LIMIT
. Ex:
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
这种方法将帮助您获取在字段子集中唯一的记录,并限制每个用户可以控制的输出记录的数量.
This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.
这篇关于使用PigLatin删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!