猪如何过滤不同的夫妇(对) [英] pig how to filter distinct couples (pairs)

查看:18
本文介绍了猪如何过滤不同的夫妇(对)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Pig 的新手.我有一个 Pig 脚本,它在两个元素之间生成制表符分隔对.每行一对,例如:

I am new to Pig. I have a Pig script which generates tab-separated pairs between two element. One pair for each line, for example:

John   Paul
Tom    Nik
Mark   Bill
Tom    Nik
Paul   John

我需要过滤掉重复的组合.如果我使用 DISTINCT,我会过滤掉两个Tom Nik"条目.结果是:

I need to filter out duplicate combinations. If I use DISTINCT, I filter out double "Tom Nik" entry. The result is:

John   Paul
Tom    Nik
Mark   Bill
Paul   John

这种方法的问题是我留下了John Paul"和Paul John",就我的目的而言,它们应该被视为相同(相同的组合).有没有办法去除排列组合?

The problem with this approach is that I am left with both "John Paul" and "Paul John", which for my purposes should be treated as the same (same combination). Is there a way to remove permutate combinations?

推荐答案

我不确定 Pig 中是如何实现字符串比较的,但尝试以下方法可能值得:

I'm not sure how string comparisons is implemented in Pig, but it may be worthwhile to try something like:

-- A is your input
B = FOREACH A GENERATE FLATTEN(($0 < $1 ? ($0, $1) : ($1, $0))) ; 
C = DISTINCT B ;

通过对名称进行排序,以便较小的"始终首先出现 John PaulPaul John 现在应该处于相同的顺序,使 DISTINCT 消除一个.

By sorting the names so that the 'smaller' always appears first both John Paul and Paul John should now be in the same order, making the DISTINCT eliminate one.

但是,这种方法完全取决于字符串比较是如何实现的.例如,如果它比较长度,那么 John Paul 大小写将不会被正确过滤.

However, this approach all depends on how the string comparison is implemented. For example if it compares length then the John Paul case will not be filtered correctly.

这篇关于猪如何过滤不同的夫妇(对)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆