猪如何过滤不同的夫妇(对) [英] pig how to filter distinct couples (pairs)
问题描述
我是 Pig 的新手.我有一个 Pig 脚本,它在两个元素之间生成制表符分隔对.每行一对,例如:
I am new to Pig. I have a Pig script which generates tab-separated pairs between two element. One pair for each line, for example:
John Paul
Tom Nik
Mark Bill
Tom Nik
Paul John
我需要过滤掉重复的组合.如果我使用 DISTINCT,我会过滤掉两个Tom Nik"条目.结果是:
I need to filter out duplicate combinations. If I use DISTINCT, I filter out double "Tom Nik" entry. The result is:
John Paul
Tom Nik
Mark Bill
Paul John
这种方法的问题是我留下了John Paul"和Paul John",就我的目的而言,它们应该被视为相同(相同的组合).有没有办法去除排列组合?
The problem with this approach is that I am left with both "John Paul" and "Paul John", which for my purposes should be treated as the same (same combination). Is there a way to remove permutate combinations?
推荐答案
我不确定 Pig 中是如何实现字符串比较的,但尝试以下方法可能值得:
I'm not sure how string comparisons is implemented in Pig, but it may be worthwhile to try something like:
-- A is your input
B = FOREACH A GENERATE FLATTEN(($0 < $1 ? ($0, $1) : ($1, $0))) ;
C = DISTINCT B ;
通过对名称进行排序,以便较小的"始终首先出现 John Paul
和 Paul John
现在应该处于相同的顺序,使 DISTINCT
消除一个.
By sorting the names so that the 'smaller' always appears first both John Paul
and Paul John
should now be in the same order, making the DISTINCT
eliminate one.
但是,这种方法完全取决于字符串比较是如何实现的.例如,如果它比较长度,那么 John Paul
大小写将不会被正确过滤.
However, this approach all depends on how the string comparison is implemented. For example if it compares length then the John Paul
case will not be filtered correctly.
这篇关于猪如何过滤不同的夫妇(对)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!