蜂巢-通过汇总组中的值来创建地图列类型 [英] Hive - Create map columns type by aggregating values across groups

查看:113
本文介绍了蜂巢-通过汇总组中的值来创建地图列类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的表:

I have a table that looks like this:

|customer|category|room|date|
-----------------------------
|1       |   A    | aa | d1 |
|1       |   A    | bb | d2 |
|1       |   B    | cc | d3 |
|1       |   C    | aa | d1 |
|1       |   C    | bb | d2 |
|2       |   A    | aa | d3 |
|2       |   A    | bb | d4 |
|2       |   C    | bb | d4 |
|2       |   C    | ee | d5 |
|3       |   D    | ee | d6 |

我想在表格外创建两个地图:

I want to create two maps out of the table:

第一. map_customer_room_date :将对客户进行分组,并收集所有不同的房间()并带有日期(值)).

1st. map_customer_room_date: will group by customer and collect all different rooms (key) and with date (value).

我正在使用 collect() UDF Brickhouse > 功能.

I'm using the collect() UDF Brickhouse function.

可以使用类似于以下内容的文件对其进行存档:

select customer, collect(room,date) as map_customer_room_date
from table
group by customer

第二名. map_category_room_date 稍微复杂一点,还包含相同的地图类型 collect(room,date),它将包含键,所有类别中的所有房间,其中客户X是类别.这意味着对于 customer1 ,即使它属于 customer2 ,也会占用 ee 空间.这是因为customer1的类别为 C ,并且此类别也存在于客户2中.

2nd. map_category_room_date A bit more complicated, consists also of the same map type collect(room, date) and it will contain as keys, all the rooms across ALL categories where for customer X is categories. This means that for customer1 it will take room ee even though it belongs to customer2. This is because customer1 has category C and this category is also present in customer 2.

最终表是按客户分组的,外观如下:

The final table is grouped by customer and will look like:

|customer| map_customer_room_date  |     map_category_room_date    |
-------------------------------------------------------------------|
|   1    |{aa: d1, bb: d2, cc: d3} |{aa: d1, bb: d2, cc: d3,ee: d6}|
|   2    |{aa: d3, bb: d4, ee: d6} |{aa: d3, bb: d4, ee: d6}       |
|   3    |{ee: d6}                 |{ee: d6}                       |  

我在构建第二张地图并展示所描述的决赛桌时遇到了问题.知道如何实现吗?

I am having issues building the second map and presenting the final table as described. Any idea how this can be accomplished?

推荐答案

在将结果合并为2个地图之前,可以使用一系列自连接来查找同一类别的其他房间,以实现此目的.

This can be accomplished using a series of self-joins to find other rooms in the same category before combining the results into 2 maps.

CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;


SELECT
    customer_rooms.customer,
    collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
    collect(
        COALESCE(customer_category_rooms.room, category_rooms.room),
        COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
    customer_rooms.customer = customer_category_rooms.customer AND
    customer_rooms.category = customer_category_rooms.category AND
    customer_rooms.room = customer_category_rooms.room AND
    customer_rooms.date = customer_category_rooms.date
)
OR (
    customer_category_rooms.customer IS NULL AND
    customer_category_rooms.category IS NULL AND
    customer_category_rooms.room IS NULL AND
    customer_category_rooms.date IS NULL
)
GROUP BY
    customer_rooms.customer
;

结果集

1   {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2   {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3   {"ee":"d6"} {"ee":"d6"}

说明

FROM `table` AS customer_rooms

首先,从最初的中提取结果.我们将此关系命名为 customer_rooms .正如您在问题中已经提到的那样,足够构建 map_customer_room_date .

First, results are drawn from the initial table. We name this relation customer_rooms. As you already noted in the question, this much is sufficient to build map_customer_room_date.

JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category

第一个自我联接标识与 customer_rooms 行中明确提及的房间具有相同类别的所有房间.我们将此关系命名为 category_rooms .

The first self-join identifies all rooms that have the same categories as the rooms explicitly mentioned in the customer_rooms rows. We name this relation category_rooms.

LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room

第二个自联接将获取我们在 category_rooms 中标识的房间,并尝试查找此房间是否已由 customer_rooms 中标识的客户持有.我们将此关系命名为 customer_category_rooms .这是一个 LEFT OUTER JOIN ,因为我们要保留先前联接中的所有行.结果将是1)来自客户已经拥有此房间的 customer_rooms customer_category_rooms 的值相同,或2)来自 customer_category_rooms 的值code>全部为 NULL ,因为客户没有持有此房间,但它是同一类别之一的房间.这种区别将变得很重要,这样,如果客户已经拥有房间,我们就可以保留他们的 date .

The second self-join takes the rooms that we identified in category_rooms and attempts to find if this room is already held by the customer identified in customer_rooms. We name this relation customer_category_rooms. This is a LEFT OUTER JOIN, because we want to preserve all rows from the prior joins. The outcome will either be 1) the values from customer_rooms and customer_category_rooms are identical, because the customer already holds this room, or 2) the values from customer_category_rooms will be all NULL, because the customer doesn't hold this room, but it's a room in one of the same categories. This distinction will become important so that we can preserve the date of the customer if they already hold the room.

下一步,我们需要过滤.

Next, we need to filter.

WHERE (
    customer_rooms.customer = customer_category_rooms.customer AND
    customer_rooms.category = customer_category_rooms.category AND
    customer_rooms.room = customer_category_rooms.room AND
    customer_rooms.date = customer_category_rooms.date
)

这包括客户在原始中明确拥有的房间.

This includes the rooms that are explicitly held by the customer in the original table.

OR (
    customer_category_rooms.customer IS NULL AND
    customer_category_rooms.category IS NULL AND
    customer_category_rooms.room IS NULL AND
    customer_category_rooms.date IS NULL
)

这包括不是由客户拥有的房间,而是与客户拥有的房间属于同一类别的房间.

This includes the rooms that are not held by the customer but belong in the same categories as rooms held by the customer.

    collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,

可以通过从表中收集原始数据来构建

map_customer_room_date ,我们将其别名为 customer_rooms .

map_customer_room_date can be built by collecting the original data from the table, which we had aliased as customer_rooms.

    collect(
        COALESCE(customer_category_rooms.room, category_rooms.room),
        COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date

构建 map_category_room_date 更为复杂.如果客户明确持有房间,那么我们要保留该日期.但是,如果客户没有明确持有房间,那么我们希望能够使用另一行中具有重叠类别的 room date .为此,我们使用Hive COALESCE 函数选择不是 NULL 的第一个值.如果客户已经拥有房间(在 customer_category_rooms 中的非 NULL 值中存在),那么我们将使用该房间.如果没有,那么我们将改用 category_rooms 中的值.

Building map_category_room_date is more complex. If the customer holds the room explicitly, then we want to preserve that date. However, if the customer does not hold the room explicitly, then we want to be able to use the room and date from another row that has an overlapping category. To accomplish that, we use the Hive COALESCE function to choose the first value that is not NULL. If the customer already holds the room (as present in non-NULL values in customer_category_rooms), then we'll use that. If not, then we'll use values from category_rooms instead.

请注意,如果相同的类别/房间组合可以映射到多个 date 值,仍然可能存在歧义.如果那很重要,那么您可能需要做更多的工作以根据某些业务规则(例如,使用最快的 date )选择正确的 date 或映射到多个 date 值,而不是单个值.如果有类似的其他要求,这应该为您提供一个良好的起点.

Note that there can still be some ambiguity if the same category/room combination can map to multiple date values. If that is significant, then you might need to put more work into this to choose the right date based on some business rule (e.g. use the soonest date) or map to multiple date values instead of a single value. This should give you a good starting point though if there are additional requirements like that.

这篇关于蜂巢-通过汇总组中的值来创建地图列类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆