如何按操作员从 Hive 组中获取元素的数组/包? [英] How to get array/bag of elements from Hive group by operator?

查看:21
本文介绍了如何按操作员从 Hive 组中获取元素的数组/包?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想按给定的字段分组并获得分组字段的输出.下面是我试图实现的一个例子:-

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-

想象一个名为sample_table"的表,其中包含如下两列:-

Imagine a table named 'sample_table' with two columns as below:-

F1  F2
001 111
001 222
001 123
002 222
002 333
003 555

我想编写将提供以下输出的 Hive 查询:-

I want to write Hive Query that will give the below output:-

001 [111, 222, 123]
002 [222, 333]
003 [555]

在 Pig 中,这可以很容易地通过以下方式实现:-

In Pig, this can be very easily achieved by something like this:-

grouped_relation = GROUP sample_table BY F1;

有人可以建议在 Hive 中是否有一种简单的方法吗?我能想到的是为此编写一个用户定义函数 (UDF),但这可能是一个非常耗时的选择.

Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.

推荐答案

内置聚合函数 collect_set (在此处添加) 几乎可以满足您的需求.它实际上适用于您的示例输入:

The built in aggregate function collect_set (doumented here) gets you almost what you want. It would actually work on your example input:

SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1

不幸的是,它还删除了重复的元素,我想这不是您想要的行为.我发现 collect_set 存在很奇怪,但没有保留重复的版本.其他人显然也有同样的想法.看起来第一个和第二个答案会给你你需要的 UDAF.

Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.

这篇关于如何按操作员从 Hive 组中获取元素的数组/包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆