使用 PigLatin 删除重复项 [英] Removing duplicates using PigLatin
问题描述
我正在使用 PigLatin 过滤一些记录.
I'm using PigLatin to filter some records.
User1 8 NYC
User1 9 NYC
User1 7 LA
User2 4 NYC
User2 3 DC
脚本应为用户删除重复项,并保留这些记录之一.类似于 linux 中的唯一命令.
The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.
输出应该是:
User1 8 NYC
User2 4 NYC
有什么建议吗?
推荐答案
对于您的特定示例 distinct 将无法正常工作,因为您的输出包含所有输入列 ($0, $1, $2)
,您只能在具有 ($0, $2)
或 ($0)
列并丢失 $1
的投影上执行不同的操作.
For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2)
, you can do distinct only on a projection that has columns ($0, $2)
or ($0)
and lose $1
.
为了为每个用户选择一条记录(任何记录),您可以使用 GROUP BY
和嵌套的 FOREACH
和 LIMIT
.例如:
In order to select one record per user (any record) you could use a GROUP BY
and a nested FOREACH
with LIMIT
. Ex:
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
这种方法将帮助您获得在字段子集上唯一的记录,并限制每个用户的输出记录数,这是您可以控制的.
This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.
这篇关于使用 PigLatin 删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!