在Pig中,是否可以将具有关联关系的行与该行中的元组交叉连接? [英] Is it possible to cross-join a row in a relation with a tuple in that row in Pig?

查看:80
本文介绍了在Pig中,是否可以将具有关联关系的行与该行中的元组交叉连接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数据,可以显示用户,他们喜欢的水果集合以及所在城市:

I have a set of data that shows users, collections of fruit they like, and home city:

Alice\tApple:Orange\tSacramento
Bob\tApple\tSan Diego
Charlie\tApple:Pineapple\tSacramento

我想创建一个猪查询,该查询将在不同城市中享受水果类型的用户数量相关联,上面数据的查询结果看起来像这样:

I would like to create a pig query that correlates the number of users that enjoy tyeps of fruits in different cities, where the results from the query for the data above would look like this:

Apple\tSacramento\t2
Apple\tSan Diego\t1
Orange\tSacramento\t1
Pineapple\tSacramento\t1

我不知道的部分是如何将拆分的水果行与同一行的其余数据交叉连接,所以:

The part I can't figure out is how to cross join the split fruit rows with the rest of the data from the same row, so:

Alice\tApple:Orange\tSacramento

成为:

Alice\tApple\tSacramento 
Alice\tOrange\tSacramento

我知道我可以使用TOKENIZE将字符串"Apple:Orange"分割为元组("Apple","Orange"),但是我不知道如何获取该元组与其余部分的叉积.行(爱丽丝").

I know I can use TOKENIZE to split the string 'Apple:Orange' into the tuple ('Apple', 'Orange'), but I don't know how to get the cross product of that tuple with the rest of the row ('Alice').

我想到的一种蛮力解决方案是使用流传输通过外部程序运行输入集合,并处理交叉连接"以在那里每行产生多行.

One brute-force solution I came up with is to use the streaming to run the input collection through an external program, and handle the "cross join" to produce multiple rows per row there.

这似乎应该没有必要.有更好的主意吗?

This seems like it should be unnecessary though. Are there better ideas?

推荐答案

您应使用

You should use FLATTEN, which works great with TOKENIZE to do stuff like this.

b = FOREACH a GENERATE name, FLATTEN(TOKENIZE(fruits)) as fruit, city;

FLATTEN 拿出一个袋子,然后将其展平"到不同的行中. TOKENIZE 将您的水果分成一个袋子(不是您所说的元组),然后 FLATTEN 进行类似您想要的交叉行为.我指出这是一个包而不是一个元组,因为 FLATTEN 被重载并且在元组中的行为有所不同.

FLATTEN takes a bag and "flattens" it out across different rows. TOKENIZE breaks your fruits out into a bag (not a tuple like you said), and then FLATTEN does the cross-like behavior like you are looking for. I point out that it is a bag and not a tuple, because FLATTEN is overloaded and behaves differently with tuples.

我首先在规范的单词计数示例中了解了 FLATTEN / TOKENIZE 技术,该技术先将单词标记化,然后将单词拼合成行.

I first learned of the FLATTEN/TOKENIZE technique in the canonical word count example, in which is tokenizes a word, then flattens the words out into rows.

这篇关于在Pig中,是否可以将具有关联关系的行与该行中的元组交叉连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆