如何通过蜂巢中的collect_set()操作使用订单 [英] how to use order by with collect_set() operation in hive
问题描述
在表1中,我有customer_id,item_id和item_rank(根据某些销售情况的项目等级).我想为每个customer_id收集项目列表,并根据item_rank进行排列.
In Table 1, I have customer_id, item_id and item_rank (rank of item according to some sales). I want to collect a list of items for each customer_id and arrange them according to item_rank.
Customer_id item_id rank_item
23 2 3
23 2 3
23 4 2
25 5 1
25 4 2
我期望的输出是
Customer_id item_list
23 4,2
25 5,4
我使用的代码是
SELECT
customer_id,
concat_ws(',',collect_list (string(item_id))) AS item_list
FROM
table1
GROUP BY
customer_id
ORDER BY
item_rank
推荐答案
您可以使用子查询来获取结果集("customer_id","item_id","item_rank"),并按item_rank进行排序,然后使用 collect_set
在外部查询中.
You can use a sub-query to get a result set of (customer_id, item_id, item_rank), sorted by item_rank, and then use collect_set
in the outer query.
查询
WITH table1 AS (
SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
SELECT 23 AS customer_id, 4 AS item_id, 2 AS item_rank UNION ALL
SELECT 25 AS customer_id, 5 AS item_id, 1 AS item_rank UNION ALL
SELECT 25 AS customer_id, 4 AS item_id, 2 AS item_rank
)
SELECT
subquery.customer_id,
collect_set(subquery.item_id) AS item_id_set
FROM (
SELECT
table1.customer_id,
table1.item_id,
table1.item_rank
FROM table1
DISTRIBUTE BY
table1.customer_id
SORT BY
table1.customer_id,
table1.item_rank
) subquery
GROUP BY
subquery.customer_id
;
结果
customer_id item_id_set
0 23 [4,2]
1 25 [5,4]
子查询使用 customer_id
的所有行都路由到相同的reducer.然后,它使用 SORT BY
在每个化简器中按 customer_id
和 item_rank
进行排序.我希望这足以满足要求,因为我没有注意到对最终结果集进行整体排序的要求.(如果需要按 customer_id
进行总排序,那么我认为查询将不得不使用 ORDER BY
,这将导致执行速度变慢.)
The sub-query uses DISTRIBUTE BY
to guarantee that all rows for a particular customer_id
route to the same reducer. It then uses SORT BY
to sort by customer_id
and item_rank
within each reducer. I expect this is sufficient for the requirements, because I didn't notice a requirement for total ordering of the final result set. (If total ordering by customer_id
is a requirement, then I think the query would have to use ORDER BY
, which would cause slower execution.)
内部, collect_set
UDAF使用Java LinkedHashSet
,这是一个保留顺序的集合,因此子查询中使用的排序顺序将保留在外部查询的集合中.这在Hive代码库中可见:
Internally, the collect_set
UDAF uses a Java LinkedHashSet
, which is an order-preserving collection, so the same sort order used in the sub-query will be maintained in the outer query's set. This is visible in the Hive codebase here: