蜂巢-通过汇总组中的值来创建地图列类型 [英] Hive - Create map columns type by aggregating values across groups
问题描述
我有一个看起来像这样的表:
I have a table that looks like this:
|customer|category|room|date|
-----------------------------
|1 | A | aa | d1 |
|1 | A | bb | d2 |
|1 | B | cc | d3 |
|1 | C | aa | d1 |
|1 | C | bb | d2 |
|2 | A | aa | d3 |
|2 | A | bb | d4 |
|2 | C | bb | d4 |
|2 | C | ee | d5 |
|3 | D | ee | d6 |
我想在表格外创建两个地图:
I want to create two maps out of the table:
第一. map_customer_room_date :将对客户进行分组,并收集所有不同的房间(键)并带有日期(值)).
1st. map_customer_room_date: will group by customer and collect all different rooms (key) and with date (value).
我正在使用 collect()
UDF Brickhouse > 功能.
I'm using the collect()
UDF Brickhouse function.
可以使用类似于以下内容的文件对其进行存档:
select customer, collect(room,date) as map_customer_room_date
from table
group by customer
第二名. map_category_room_date 稍微复杂一点,还包含相同的地图类型 collect(room,date)
,它将包含键,所有类别中的所有房间,其中客户X是类别.这意味着对于 customer1 ,即使它属于 customer2 ,也会占用 ee
空间.这是因为customer1的类别为 C
,并且此类别也存在于客户2中.
2nd. map_category_room_date A bit more complicated, consists also of the same map type collect(room, date)
and it will contain as keys, all the rooms across ALL categories where for customer X is categories.
This means that for customer1 it will take room ee
even though it belongs to customer2. This is because customer1 has category C
and this category is also present in customer 2.
最终表是按客户分组的,外观如下:
The final table is grouped by customer and will look like:
|customer| map_customer_room_date | map_category_room_date |
-------------------------------------------------------------------|
| 1 |{aa: d1, bb: d2, cc: d3} |{aa: d1, bb: d2, cc: d3,ee: d6}|
| 2 |{aa: d3, bb: d4, ee: d6} |{aa: d3, bb: d4, ee: d6} |
| 3 |{ee: d6} |{ee: d6} |
我在构建第二张地图并展示所描述的决赛桌时遇到了问题.知道如何实现吗?
I am having issues building the second map and presenting the final table as described. Any idea how this can be accomplished?
推荐答案
在将结果合并为2个地图之前,可以使用一系列自连接来查找同一类别的其他房间,以实现此目的.
This can be accomplished using a series of self-joins to find other rooms in the same category before combining the results into 2 maps.
CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;
SELECT
customer_rooms.customer,
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
GROUP BY
customer_rooms.customer
;
结果集
1 {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2 {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3 {"ee":"d6"} {"ee":"d6"}
说明
FROM `table` AS customer_rooms
首先,从最初的表
中提取结果.我们将此关系命名为 customer_rooms
.正如您在问题中已经提到的那样,足够构建 map_customer_room_date
.
First, results are drawn from the initial table
. We name this relation customer_rooms
. As you already noted in the question, this much is sufficient to build map_customer_room_date
.
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
第一个自我联接标识与 customer_rooms
行中明确提及的房间具有相同类别的所有房间.我们将此关系命名为 category_rooms
.
The first self-join identifies all rooms that have the same categories as the rooms explicitly mentioned in the customer_rooms
rows. We name this relation category_rooms
.
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
第二个自联接将获取我们在 category_rooms
中标识的房间,并尝试查找此房间是否已由 customer_rooms
中标识的客户持有.我们将此关系命名为 customer_category_rooms
.这是一个 LEFT OUTER JOIN
,因为我们要保留先前联接中的所有行.结果将是1)来自客户已经拥有此房间的 customer_rooms
和 customer_category_rooms
的值相同,或2)来自 customer_category_rooms 的值code>全部为
NULL
,因为客户没有持有此房间,但它是同一类别之一的房间.这种区别将变得很重要,这样,如果客户已经拥有房间,我们就可以保留他们的 date
.
The second self-join takes the rooms that we identified in category_rooms
and attempts to find if this room is already held by the customer identified in customer_rooms
. We name this relation customer_category_rooms
. This is a LEFT OUTER JOIN
, because we want to preserve all rows from the prior joins. The outcome will either be 1) the values from customer_rooms
and customer_category_rooms
are identical, because the customer already holds this room, or 2) the values from customer_category_rooms
will be all NULL
, because the customer doesn't hold this room, but it's a room in one of the same categories. This distinction will become important so that we can preserve the date
of the customer if they already hold the room.
下一步,我们需要过滤.
Next, we need to filter.
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
这包括客户在原始表
中明确拥有的房间.
This includes the rooms that are explicitly held by the customer in the original table
.
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
这包括不是由客户拥有的房间,而是与客户拥有的房间属于同一类别的房间.
This includes the rooms that are not held by the customer but belong in the same categories as rooms held by the customer.
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
可以通过从表中收集原始数据来构建
map_customer_room_date
,我们将其别名为 customer_rooms
.
map_customer_room_date
can be built by collecting the original data from the table, which we had aliased as customer_rooms
.
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
构建 map_category_room_date
更为复杂.如果客户明确持有房间,那么我们要保留该日期
.但是,如果客户没有明确持有房间,那么我们希望能够使用另一行中具有重叠类别的 room
和 date
.为此,我们使用Hive COALESCE 函数选择不是 NULL
的第一个值.如果客户已经拥有房间(在 customer_category_rooms
中的非 NULL
值中存在),那么我们将使用该房间.如果没有,那么我们将改用 category_rooms
中的值.
Building map_category_room_date
is more complex. If the customer holds the room explicitly, then we want to preserve that date
. However, if the customer does not hold the room explicitly, then we want to be able to use the room
and date
from another row that has an overlapping category. To accomplish that, we use the Hive COALESCE function to choose the first value that is not NULL
. If the customer already holds the room (as present in non-NULL
values in customer_category_rooms
), then we'll use that. If not, then we'll use values from category_rooms
instead.
请注意,如果相同的类别/房间组合可以映射到多个 date
值,仍然可能存在歧义.如果那很重要,那么您可能需要做更多的工作以根据某些业务规则(例如,使用最快的 date
)选择正确的 date
或映射到多个 date
值,而不是单个值.如果有类似的其他要求,这应该为您提供一个良好的起点.
Note that there can still be some ambiguity if the same category/room combination can map to multiple date
values. If that is significant, then you might need to put more work into this to choose the right date
based on some business rule (e.g. use the soonest date
) or map to multiple date
values instead of a single value. This should give you a good starting point though if there are additional requirements like that.
这篇关于蜂巢-通过汇总组中的值来创建地图列类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!