在Hadoop Pig中进行分组时,是否可以检测和处理分组值之间的字符串冲突? [英] Is it possible to detect and handle string collisions among grouped values when grouping in Hadoop Pig?

查看:67
本文介绍了在Hadoop Pig中进行分组时,是否可以检测和处理分组值之间的字符串冲突?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有如下几行数据显示用户名及其喜欢的水果:

Assuming I have lines of data like the following that show user names and their favorite fruits:

Alice\tApple
Bob\tApple
Charlie\tGuava
Alice\tOrange

我想创建一个显示每个用户喜欢的水果的猪查询.如果用户出现多次,那么我想显示多个".例如,上面数据的结果应为:

I'd like to create a pig query that shows the favorite fruit of each user. If a user appears multiple times, then I'd like to show "Multiple". For example, the result with the data above should be:

Alice\tMultiple
Bob\tApple
Charlie\tGuava

在SQL中,可以这样做(尽管不一定会表现得很好):

In SQL, this could be done something like this (although it wouldn't necessarily perform very well):

select user, case when count(fruit) > 1 then 'Multiple' else max(fruit) end
from FruitPreferences
group by user

但是我不知道等效的PigLatin.有什么想法吗?

But I can't figure out the equivalent PigLatin. Any ideas?

推荐答案

编写聚合函数" Pig UDF (向下滚动到聚合函数").这是一个用户定义的函数,它带一个包并输出一个标量.因此,基本上,您的UDF会放入袋子,确定其中是否有多个物品,然后使用if语句对其进行相应的转换.

Write a "Aggregate Function" Pig UDF (scroll down to "Aggregate Functions"). This is a user-defined function that takes a bag and outputs a scalar. So basically, your UDF would take in the bag, determine if there is more than one item in it, and transform it accordingly with an if statement.

我可以想到一种没有UDF的方法,但这绝对很尴尬.在GROUP之后,使用SPLIT将您的数据集分成两部分:一个计数为1的数据集和一个计数大于1的数据集:

I can think of a way of doing this without a UDF, but it is definitely awkward. After your GROUP, use SPLIT to split your data set into two: one in which the count is 1 and one in which the count is more than one:

SPLIT grouped INTO one IF COUNT(fruit) == 0, more IF COUNT(fruit) > 0;

然后,分别在每个上使用FOREACH ... GENERATE进行转换:

Then, separately use FOREACH ... GENERATE on each to transform it:

one = FOREACH one GENERATE name, MAX(fruit); -- hack using MAX to get the item
more = FOREACH more GENERATE name, 'Multiple';

最后,将它们合并回去:

Finally, union them back:

out = UNION one, more;

我还没有真正找到一种更好的方法来根据某些条件(如您所愿)以两种不同的方式处理同一数据集.我通常会像在这里一样进行某种拆分/重组.我相信Pig会很聪明,并制定不超过1个M/R工作的计划.

I haven't really found a better way of handing the same data set in two different ways based on some conditional, like you want. I typically do some sort of split/recombine like I did here. I believe Pig will be smart and make a plan that doesn't use more than 1 M/R job.

免责声明:我目前无法实际测试此代码,因此可能会有一些错误.

Disclaimer: I can't actually test this code at the moment, so it may have some mistakes.

更新:

在仔细查看时,我想起了双键运算符,我认为这会在这里起作用.

In looking harder, I was reminded of the bicond operator and I think that will work here.

b = FOREACH a GENERATE name, (COUNT(fruit)==1 ? MAX(FRUIT) : 'Multiple');

这篇关于在Hadoop Pig中进行分组时,是否可以检测和处理分组值之间的字符串冲突?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆