在Pig中,是否可以将具有关联关系的行与该行中的元组交叉连接? [英] Is it possible to cross-join a row in a relation with a tuple in that row in Pig?
问题描述
我有一组数据,可以显示用户,他们喜欢的水果集合以及所在城市:
I have a set of data that shows users, collections of fruit they like, and home city:
Alice\tApple:Orange\tSacramento
Bob\tApple\tSan Diego
Charlie\tApple:Pineapple\tSacramento
我想创建一个猪查询,该查询将在不同城市中享受水果类型的用户数量相关联,上面数据的查询结果看起来像这样:
I would like to create a pig query that correlates the number of users that enjoy tyeps of fruits in different cities, where the results from the query for the data above would look like this:
Apple\tSacramento\t2
Apple\tSan Diego\t1
Orange\tSacramento\t1
Pineapple\tSacramento\t1
我不知道的部分是如何将拆分的水果行与同一行的其余数据交叉连接,所以:
The part I can't figure out is how to cross join the split fruit rows with the rest of the data from the same row, so:
Alice\tApple:Orange\tSacramento
成为:
Alice\tApple\tSacramento
Alice\tOrange\tSacramento
我知道我可以使用TOKENIZE将字符串"Apple:Orange"分割为元组("Apple","Orange"),但是我不知道如何获取该元组与其余部分的叉积.行(爱丽丝").
I know I can use TOKENIZE to split the string 'Apple:Orange' into the tuple ('Apple', 'Orange'), but I don't know how to get the cross product of that tuple with the rest of the row ('Alice').
我想到的一种蛮力解决方案是使用流传输通过外部程序运行输入集合,并处理交叉连接"以在那里每行产生多行.
One brute-force solution I came up with is to use the streaming to run the input collection through an external program, and handle the "cross join" to produce multiple rows per row there.
这似乎应该没有必要.有更好的主意吗?
This seems like it should be unnecessary though. Are there better ideas?
推荐答案
You should use FLATTEN
, which works great with TOKENIZE
to do stuff like this.
b = FOREACH a GENERATE name, FLATTEN(TOKENIZE(fruits)) as fruit, city;
FLATTEN
拿出一个袋子,然后将其展平"到不同的行中. TOKENIZE
将您的水果分成一个袋子(不是您所说的元组),然后 FLATTEN
进行类似您想要的交叉行为.我指出这是一个包而不是一个元组,因为 FLATTEN
被重载并且在元组中的行为有所不同.
FLATTEN
takes a bag and "flattens" it out across different rows. TOKENIZE
breaks your fruits out into a bag (not a tuple like you said), and then FLATTEN
does the cross-like behavior like you are looking for. I point out that it is a bag and not a tuple, because FLATTEN
is overloaded and behaves differently with tuples.
我首先在规范的单词计数示例中了解了 FLATTEN
/ TOKENIZE
技术,该技术先将单词标记化,然后将单词拼合成行.
I first learned of the FLATTEN
/TOKENIZE
technique in the canonical word count example, in which is tokenizes a word, then flattens the words out into rows.
这篇关于在Pig中,是否可以将具有关联关系的行与该行中的元组交叉连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!