如何从运算符获取Hive组中的数组/元素? [英] How to get array/bag of elements from Hive group by operator?
问题描述
我想按给定的字段进行分组并获得分组字段的输出。下面是我想要实现的一个例子: - $ /
想象一个名为'sample_table'的表,其中有两列: - $ /
F1 F2
001 111
001 222
001 123
002 222
002 333
003 555
我想写Hive Query,它会给出下面的输出: -
001 [111,222,123]
002 [222,333]
003
在Pig中,可以通过如下方式轻松实现: - $ /
groupped_relation = GROUP sample_table BY F1;
有人可以建议Hive中是否有简单的方法?我能想到的是为此编写一个用户定义函数(UDF),但这可能是一个非常耗时的选择。
构建的聚合函数 collect_set
( doumented here )让你几乎得到你想要的。它实际上可以用于你的示例输入:
SELECT F1,collect_set(F2)
FROM sample_table
GROUP BY F1
不幸的是,它也删除了重复的元素,我想这不是你想要的行为。我发现奇怪的是 collect_set
存在,但没有保留重复的版本。 其他人显然认为同样的事情。它看起来像是第一个答案,第二个答案会给你你需要的UDAF。
I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-
Imagine a table named 'sample_table' with two columns as below:-
F1 F2
001 111
001 222
001 123
002 222
002 333
003 555
I want to write Hive Query that will give the below output:-
001 [111, 222, 123]
002 [222, 333]
003 [555]
In Pig, this can be very easily achieved by something like this:-
grouped_relation = GROUP sample_table BY F1;
Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.
The built in aggregate function collect_set
(doumented here) gets you almost what you want. It would actually work on your example input:
SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1
Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set
exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.
这篇关于如何从运算符获取Hive组中的数组/元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!