收集到 Hive 中的地图 [英] Collect to a Map in Hive

查看:40
本文介绍了收集到 Hive 中的地图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Hive 表,例如

I have a Hive table such as

id  |  value
-------------
A      1
A      2
B      3
A      4
B      5

本质上,我想模仿 Python 的 defaultdict(list) 并创建一个以 id 作为键和 value 作为值的映射.

Essentially, I want to mimic Python's defaultdict(list) and create a map with id as the keys and value as the values.

查询:

select COLLECT_TO_A_MAP(id, value)
from table

输出:

{A:[1,2,4], B:[3,5]}

我尝试使用 klout'sCollectUDAF() 但它似乎不会将值附加到数组中,它只会更新它们.有什么想法吗?

I tried using klout's CollectUDAF() but it appears this will not append the values to an array, it will just update them. Any ideas?

这是更详细的描述,因此我可以避免参考我在 Hive 文档中尝试函数的答案.假设我有一张桌子

Here is a more detailed description so I can avoid answers referencing that I try functions in the Hive documentation. Suppose I have a table

num    |id    |value
____________________
1       A      1
1       A      2
1       B      3
2       A      4
2       B      5
2       B      6

我正在寻找的是提供此输出的 UDAF

What I am looking for is for a UDAF that provides this output

num     |new_map
________________________
1       {A:[1,2], B:[3]}
2       {A:[4], B:[5,6]}

到这个查询

select num
      ,COLLECT_TO_A_MAP(id, value) as new_map
from table
group by num

有一种解决方法可以实现这一点.可以通过在诸如

There is a workaround to achieve this. It can be mimicked by using Klout's (see above referenced UDAF) CollectUDAF() in a query such as

add jar '~/brickhouse/target/brickhouse-0.6.0.jar'
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select num
       ,collect(id_array, value_array) as new_map
from (
      select collect_list(id) as id_array
            ,collect_list(value) as value_array
            ,num
      from table
      group by num
     ) A
group by num

但是,我宁愿不编写嵌套查询.

However, I would rather not write a nested query.

编辑#2

(如我的原始问题中所述)我已经尝试使用 Klout's CollectUDAF(),即使在您传递两个参数并创建地图的情况下.输出是(如果在我的第一次编辑中应用于数据集)

(As referenced in my original question) I have already tried using Klout's CollectUDAF(), even in the instance where you pass it two parameter and it creates a map. The output from that is (if applied to the dataset in my 1st edit)

1    {A:2, B:3}
2    {A:4, B:6}

如我最初的问题所述,它不会将值收集到数组中,它只会收集最后一个(或更新数组).

As stated in my original question, it doesn't collect the values to an array it just collects the last one (or updates the array).

推荐答案

在 Brickhouse 中使用 collect UDF (http://github.com/klout/brickhouse )

Use the collect UDF in Brickhouse (http://github.com/klout/brickhouse )

这正是您所需要的.如果使用一个参数,Brickhouse 的 'collect' 返回一个列表,如果使用两个参数,则返回一个地图.

It is exactly what you need. Brickhouse's 'collect' returns a list if one parameter is used, and a map if two parameters are used.

这篇关于收集到 Hive 中的地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆