将多个数组合并到地图中 [英] Merging multiple arrays into a map
问题描述
我有一些数据(来自全表的样本),看起来像这样:
I have some data (sample from full table) that looks like this:
| prov_id | hotel_id | m_id | apis_xml | company_id | yyyy_mm_dd |
|---------|----------|------|----------|------------|------------|
| 945 | 78888 | 3910 | [5] | 998 | 2020-05-20 |
| 1475 | 78888 | 6676 | [1,2,4] | 37 | 2020-05-20 |
| 1475 | 78888 | 6670 | [1,2,4] | 37 | 2020-05-20 |
| 945 | 78888 | 2617 | [5] | 998 | 2020-05-20 |
我想找到每个酒店的最低 apis_xml值,并将关联的prov_id设置为"primary_prov".我可以通过合并并排序在第一行之前升序排列的数组来做到这一点.我对此有一个疑问:
I want to find the lowest apis_xml value per hotel and have the associated prov_id set as the "primary_prov". I can do this by merging and sorting the array ascending before taking the first row. I've got a query to this:
SELECT
yyyy_mm_dd,
hotel_id,
prov_id as primary_prov
FROM(
SELECT
yyyy_mm_dd,
hotel_id,
prov_id,
apis_xml,
ROW_NUMBER() OVER(PARTITION BY yyyy_mm_dd, hotel_id ORDER BY apis_xml) rn
FROM(
SELECT
t.yyyy_mm_dd,
t.hotel_id,
t.prov_id,
t.apis_xml,
CAST(e.apis_xml AS INT) AS api
FROM
my_table t
LATERAL VIEW EXPLODE(apis_xml) e AS apis_xml
)s
)s
WHERE rn=1
上面的查询用于获取映射到最低apis_xml值的primary_prov.但是,我想有一个额外的列,其中包含其他xml值以供参考.也许地图不是这里的最佳用例,但这就是我想到的.密钥将是prov_id,而值将是apis_xml值.输出看起来像这样,因为我需要将其保留为每个hotel_id一行:
The above query works for getting the primary_prov which maps to the lowest apis_xml value. However, I want to have an extra column which contains the additional xml values for reference. Maybe a map is not the best use case here but it's what comes to mind. The key would be the prov_id while the value would be the apis_xml values. Output would look like this as I need to keep it to one row per hotel_id:
| hotel_id | primary_prov | detailed_prov | yyyy_mm_dd |
|----------|--------------|--------------------------|------------|
| 78888 | 1475 | {1475: [1,2,4], 945: [5] | 2020-05-20 |
推荐答案
您可以使用Brickhouse CollectUDAF构建 map
.阅读说明如何构建JAR ,添加jar,创建函数并使用它:>
You can build map
using brickhouse CollectUDAF. Read instructions how to build JAR, add jar, create function and use it:
add jar /path/to/jar/brickhouse-0.7.1.jar;
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
SELECT yyyy_mm_dd,
hotel_id,
primary_prov,
collect(prov_id, prov_id_api ) as detailed_prov
FROM
(
SELECT
yyyy_mm_dd,
hotel_id,
max(case when rn=1 then prov_id else null end) over(partition by yyyy_mm_dd, hotel_id) as primary_prov,
collect_set(api) over (partition by yyyy_mm_dd, hotel_id, prov_id) prov_id_api, --re-assemple array to include all elements from multiple initial arrays if there are different arrays per prov_id
prov_id
FROM(
SELECT
yyyy_mm_dd,
hotel_id,
prov_id,
apis_xml,
api,
ROW_NUMBER() OVER(PARTITION BY yyyy_mm_dd, hotel_id ORDER BY api) rn
FROM(
SELECT
t.yyyy_mm_dd,
t.hotel_id,
t.prov_id,
t.apis_xml,
CAST(e.apis_xml_element AS INT) AS api
FROM
my_table t
LATERAL VIEW EXPLODE(apis_xml) e AS apis_xml_element
)s
)s
)s
group by yyyy_mm_dd,
hotel_id,
primary_prov
这篇关于将多个数组合并到地图中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!