猪如何过滤不同的夫妇(双) [英] pig how to filter distinct couples (pairs)
问题描述
John Paul
Tom Nik
马克比尔
Tom Nik
Paul John
我需要过滤掉重复组合。如果我使用DISTINCT,则会过滤掉双Tom Nik条目。结果是:
John Paul
Tom Nik
Mark Bill
Paul John
这个方法的问题在于我留下了John Paul和Paul John,它们为我的目的应该被视为相同(相同的组合)。
有没有办法去除排列组合?
我不知道如何在Pig中实现字符串比较,但它可能是值得的尝试类似的:
- A是您的输入
B = FOREACH A GENERATE FLATTEN(($ 0 <$ 1?($ 0,$ 1):($ 1,$ 0)));
C = DISTINCT B;
通过对名称进行排序,以便小始终首先出现 John Paul
和 Paul John
现在应该是相同的顺序,使得 DISTINCT
消除一个。
然而,这种方法完全取决于如何实现字符串比较。例如,如果它比较长度,那么 John Paul
情况将不会被正确过滤。
I am new to Pig. I have a Pig script which generates tab-separated pairs between two element. One pair for each line, for example:
John Paul
Tom Nik
Mark Bill
Tom Nik
Paul John
I need to filter out duplicate combinations. If I use DISTINCT, I filter out double "Tom Nik" entry. The result is:
John Paul
Tom Nik
Mark Bill
Paul John
The problem with this approach is that I am left with both "John Paul" and "Paul John", which for my purposes should be treated as the same (same combination). Is there a way to remove permutate combinations?
I'm not sure how string comparisons is implemented in Pig, but it may be worthwhile to try something like:
-- A is your input
B = FOREACH A GENERATE FLATTEN(($0 < $1 ? ($0, $1) : ($1, $0))) ;
C = DISTINCT B ;
By sorting the names so that the 'smaller' always appears first both John Paul
and Paul John
should now be in the same order, making the DISTINCT
eliminate one.
However, this approach all depends on how the string comparison is implemented. For example if it compares length then the John Paul
case will not be filtered correctly.
这篇关于猪如何过滤不同的夫妇(双)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!