收集到Hive中的地图 [英] Collect to a Map in Hive
问题描述
我有一个Hive表,如
id |价值
-------------
A 1
A 2
B 3
A 4
B 5
本质上,我想模仿Python的 defaultdict(list)
,然后创建一个包含 id
作为键和 value
作为值的映射。
查询:
select COLLECT_TO_A_MAP(id,value)
from table
输出:
{A:[1,2,4],B:[3,5]}
我尝试使用 klout的 CollectUDAF()
但它似乎不会将值附加到数组,它只会更新它们。任何想法?
编辑:
这里是一个更详细的描述,所以我可以避免回答引用我尝试Hive中的函数文档。假设我有一张表
num | id | value
____________________
1 A 1
1 A 2
1 B 3
2 A 4
2 B 5
2 B 6
我在寻找的是提供这种输出的UDAF
num | new_map
________________________
1 {A:[1,2],B:[3]}
2 {A:[4],B:[5,6]}
至此查询
从表格
中选择num
,COLLECT_TO_A_MAP(id,value)作为new_map
$ by
有一个解决方法来实现这一点。它可以通过在查询中使用 Klout's (参见上面引用的UDAF) CollectUDAF()
来模仿,如
添加jar'〜/ brickhouse / target / brickhouse-0.6.0.jar'
创建临时函数collect as'brickhouse.udf.collect.CollectUDAF' ;
从(
)中选择num
,将(id_array,value_array)收集为new_map
选择collect_list(id)作为id_array
,collect_list(value)as value_array
,num
from table
by num
)A
by num
然而,我宁愿不写一个嵌套查询。
编辑#2
(正如我原来的问题所引用的)我已经尝试过使用 Klout's CollectUDAF()
,即使在你传递了两个参数的情况下,它也会创建一个映射。这个输出是(如果应用到我第一次编辑的数据集)
1 {A:2,B:3}
2 {A:4,B:6}
正如我原来的问题所述,它不会将值收集到刚刚收集最后一个数组的数组(或更新数组)。 使用解析方案
使用在Brickhouse中收集UDF( http://github.com/klout/brickhouse )
正是你所需要的。如果使用一个参数,Brickhouse的'collect'返回一个列表,如果使用两个参数,则返回一个映射。
I have a Hive table such as
id | value
-------------
A 1
A 2
B 3
A 4
B 5
Essentially, I want to mimic Python's defaultdict(list)
and create a map with id
as the keys and value
as the values.
Query:
select COLLECT_TO_A_MAP(id, value)
from table
Output:
{A:[1,2,4], B:[3,5]}
I tried using klout's CollectUDAF()
but it appears this will not append the values to an array, it will just update them. Any ideas?
EDIT: Here is a more detailed description so I can avoid answers referencing that I try functions in the Hive documentation. Suppose I have a table
num |id |value
____________________
1 A 1
1 A 2
1 B 3
2 A 4
2 B 5
2 B 6
What I am looking for is for a UDAF that provides this output
num |new_map
________________________
1 {A:[1,2], B:[3]}
2 {A:[4], B:[5,6]}
To this query
select num
,COLLECT_TO_A_MAP(id, value) as new_map
from table
group by num
There is a workaround to achieve this. It can be mimicked by using Klout's (see above referenced UDAF) CollectUDAF()
in a query such as
add jar '~/brickhouse/target/brickhouse-0.6.0.jar'
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select num
,collect(id_array, value_array) as new_map
from (
select collect_list(id) as id_array
,collect_list(value) as value_array
,num
from table
group by num
) A
group by num
However, I would rather not write a nested query.
EDIT #2
(As referenced in my original question) I have already tried using Klout's CollectUDAF()
, even in the instance where you pass it two parameter and it creates a map. The output from that is (if applied to the dataset in my 1st edit)
1 {A:2, B:3}
2 {A:4, B:6}
As stated in my original question, it doesn't collect the values to an array it just collects the last one (or updates the array).
Use the collect UDF in Brickhouse (http://github.com/klout/brickhouse )
It is exactly what you need. Brickhouse's 'collect' returns a list if one parameter is used, and a map if two parameters are used.
这篇关于收集到Hive中的地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!