猪如何过滤不同的夫妇(双) [英] pig how to filter distinct couples (pairs)

查看:100
本文介绍了猪如何过滤不同的夫妇(双)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是猪新来的。我有一个Pig脚本,它生成两个元素之间的制表符分隔对。例如:每条线一对:

  John Paul 
Tom Nik
马克比尔
Tom Nik
Paul John

我需要过滤掉重复组合。如果我使用DISTINCT,则会过滤掉双Tom Nik条目。结果是:

  John Paul 
Tom Nik
Mark Bill
Paul John

这个方法的问题在于我留下了John Paul和Paul John,它们为我的目的应该被视为相同(相同的组合)。
有没有办法去除排列组合?

解决方案

我不知道如何在Pig中实现字符串比较,但它可能是值得的尝试类似的:

   -  A是您的输入
B = FOREACH A GENERATE FLATTEN(($ 0 <$ 1?($ 0,$ 1):($ 1,$ 0)));
C = DISTINCT B;

通过对名称进行排序,以便小始终首先出现 John Paul Paul John 现在应该是相同的顺序,使得 DISTINCT 消除一个。

然而,这种方法完全取决于如何实现字符串比较。例如,如果它比较长度,那么 John Paul 情况将不会被正确过滤。


I am new to Pig. I have a Pig script which generates tab-separated pairs between two element. One pair for each line, for example:

John   Paul
Tom    Nik
Mark   Bill
Tom    Nik
Paul   John

I need to filter out duplicate combinations. If I use DISTINCT, I filter out double "Tom Nik" entry. The result is:

John   Paul
Tom    Nik
Mark   Bill
Paul   John

The problem with this approach is that I am left with both "John Paul" and "Paul John", which for my purposes should be treated as the same (same combination). Is there a way to remove permutate combinations?

解决方案

I'm not sure how string comparisons is implemented in Pig, but it may be worthwhile to try something like:

-- A is your input
B = FOREACH A GENERATE FLATTEN(($0 < $1 ? ($0, $1) : ($1, $0))) ; 
C = DISTINCT B ;

By sorting the names so that the 'smaller' always appears first both John Paul and Paul John should now be in the same order, making the DISTINCT eliminate one.

However, this approach all depends on how the string comparison is implemented. For example if it compares length then the John Paul case will not be filtered correctly.

这篇关于猪如何过滤不同的夫妇(双)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆