收集到Hive中的地图 [英] Collect to a Map in Hive

查看:146
本文介绍了收集到Hive中的地图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Hive表,如

  id |价值
-------------
A 1
A 2
B 3
A 4
B 5

本质上,我想模仿Python的 defaultdict(list),然后创建一个包含 id 作为键和 value 作为值的映射。



查询

  select COLLECT_TO_A_MAP(id,value)
from table

输出:

  {A:[1,2,4],B:[3,5]} 

我尝试使用 klout的 CollectUDAF()但它似乎不会将值附加到数组,它只会更新它们。任何想法?

编辑:
这里是一个更详细的描述,所以我可以避免回答引用我尝试Hive中的函数文档。假设我有一张表

  num | id | value 
____________________
1 A 1
1 A 2
1 B 3
2 A 4
2 B 5
2 B 6

我在寻找的是提供这种输出的UDAF

  num | new_map 
________________________
1 {A:[1,2],B:[3]}
2 {A:[4],B:[5,6]}

至此查询

 从表格
中选择num
,COLLECT_TO_A_MAP(id,value)作为new_map
$ by

有一个解决方法来实现这一点。它可以通过在查询中使用 Klout's (参见上面引用的UDAF) CollectUDAF()来模仿,如

 添加jar'〜/ brickhouse / target / brickhouse-0.6.0.jar'
创建临时函数collect as'brickhouse.udf.collect.CollectUDAF' ;

从(
)中选择num
,将(id_array,value_array)收集为new_map
选择collect_list(id)作为id_array
,collect_list(value)as value_array
,num
from table
by num
)A
by num

然而,我宁愿不写一个嵌套查询。



编辑#2



(正如我原来的问题所引用的)我已经尝试过使用 Klout's CollectUDAF() ,即使在你传递了两个参数的情况下,它也会创建一个映射。这个输出是(如果应用到我第一次编辑的数据集)

  1 {A:2,B:3} 
2 {A:4,B:6}

正如我原来的问题所述,它不会将值收集到刚刚收集最后一个数组的数组(或更新数组)。 使用解析方案

使用在Brickhouse中收集UDF( http://github.com/klout/brickhouse



正是你所需要的。如果使用一个参数,Brickhouse的'collect'返回一个列表,如果使用两个参数,则返回一个映射。


I have a Hive table such as

id  |  value
-------------
A      1
A      2
B      3
A      4
B      5

Essentially, I want to mimic Python's defaultdict(list) and create a map with id as the keys and value as the values.

Query:

select COLLECT_TO_A_MAP(id, value)
from table

Output:

{A:[1,2,4], B:[3,5]}

I tried using klout's CollectUDAF() but it appears this will not append the values to an array, it will just update them. Any ideas?

EDIT: Here is a more detailed description so I can avoid answers referencing that I try functions in the Hive documentation. Suppose I have a table

num    |id    |value
____________________
1       A      1
1       A      2
1       B      3
2       A      4
2       B      5
2       B      6

What I am looking for is for a UDAF that provides this output

num     |new_map
________________________
1       {A:[1,2], B:[3]}
2       {A:[4], B:[5,6]}

To this query

select num
      ,COLLECT_TO_A_MAP(id, value) as new_map
from table
group by num

There is a workaround to achieve this. It can be mimicked by using Klout's (see above referenced UDAF) CollectUDAF() in a query such as

add jar '~/brickhouse/target/brickhouse-0.6.0.jar'
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select num
       ,collect(id_array, value_array) as new_map
from (
      select collect_list(id) as id_array
            ,collect_list(value) as value_array
            ,num
      from table
      group by num
     ) A
group by num

However, I would rather not write a nested query.

EDIT #2

(As referenced in my original question) I have already tried using Klout's CollectUDAF(), even in the instance where you pass it two parameter and it creates a map. The output from that is (if applied to the dataset in my 1st edit)

1    {A:2, B:3}
2    {A:4, B:6}

As stated in my original question, it doesn't collect the values to an array it just collects the last one (or updates the array).

解决方案

Use the collect UDF in Brickhouse (http://github.com/klout/brickhouse )

It is exactly what you need. Brickhouse's 'collect' returns a list if one parameter is used, and a map if two parameters are used.

这篇关于收集到Hive中的地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆