如何在 Presto 中并行执行多个数组 unnest() [英] How to perform multiple array unnest() in parallel in Presto

查看:278
本文介绍了如何在 Presto 中并行执行多个数组 unnest()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下格式的表格

create table raw_data (

userid BIGINT,
city  VARCHAR,
campaign ARRAY <
       STRUCT <campaignid BIGINT,
               campaign_start_at TIMESTAMP,
               campaign_ends_at TIMESTAMP,
               parameters ARRAY<
                           STRUCT < goal VARCHAR,
                                    reward VARCHAR
                                  >
               campaignstatus ARRAY
                          STRUCT < seen BOOLEAN ,
                                   seen_at TIMESTAMP
                                   action VARCHAR,
                                   action_at TIMESTAMP
                                  >
                                >
                 >)

我希望最终的结果是这样的:

I want the final result to be like this:

userid|city|campaignid|campaign_start_at|campaign_ends_at|goal|reward|seen|seen_at|action|action_at

1 | Athens | 234   | 2019-03-19 12:00 |2019-03-19 14:00| 10| 2.7 | yes |2019-03-19 10:23|null|null
1 | Athens | 234   | 2019-03-19 12:00 |2019-03-19 14:00| 10| 2.7 | yes |2019-03-17 10:23|participate|2019-03-19 11:20
2 | Athens | 234   | 2019-03-19 12:00 |2019-03-19 14:00| 10| 2.7 | yes |2019-03-19 10:23|ignore|2019-03-19 10:10
3 | Athens | 234   | 2019-03-19 12:00 |2019-03-19 14:00| 10| 2.7 | null|null|null|null
3 | Athens | 234   | 2019-03-19 12:00 |2019-03-19 14:00| 10| 2.7 | yes |2019-03-19 12:23|blocked|2019-03-19 12:24

换句话说,我想取消嵌套数据并查找用户 ID 级别的信息.我尝试使用以下脚本取消嵌套表

In other words, I want to unnest the data and find info on userid level. I have tried to unnest the table using the following script

select * 
FROM raw_data 
LEFT JOIN UNNEST(campaign) as t(campaigns)

但它返回错误:表 hive.default.campaign 不存在

我的问题是:

是否可以在 presto 中并行取消嵌套多个数组?

Is it possible to unnest multiple arrays in parallel in presto?

  • 如果是,我该怎么做?
  • 如果不是,我应该遵循什么顺序来取消更高级别(用户 ID)中的列的嵌套,例如:由内而外,反之亦然?一个例子将不胜感激.

推荐答案

所以基本上我找到了一个解决方案,虽然很简单但很管用.

So basically I found a solution, rather simple but it works.

为了取消嵌套所有嵌套数组,您需要从外部数组到内部数组工作.对于这个例子

In order to unnest all the nested arrays you need to work from the outter array towards to the inner array. For this example

  • 首先根据用户 ID 取消嵌套 campaign 数组
  • 其次根据用户 ID 和活动 ID 取消嵌套 campaignstatus 数组
  • 第三次取消嵌套 parameters 数组.重要提示:parameters 数组可以作为对象(而不是数组)进行操作,因为所有数据都是字符串并且可以使用 json 函数访问.
  • first unnest the campaign array based on userid
  • secondly unnest the campaignstatus array base on userid and campaignid
  • thirdly unnest the parameters array. Important note: parameters array may be manipulated as an object (not array) as all the data are strings and can be accessed with json functions.

更具体地说,查询将是这样的:

More specifically, the query will be like this:

select 
   a.userid
   ,a.city
   ,a.campaignid 
   ,a.campaign_start_at 
   ,a.campaign_ends_at TIMESTAMP
   ,cs.sseen
   ,cs.seen_at
   ,cs.action
   ,cs.action_at
   ,json_array_get(cast(parameters as json),0) goal
   ,json_array_get(cast(parameters as json),1) reward

from (
       select
       userid
      ,city
      ,campaignid 
      ,campaign_start_at 
      ,campaign_ends_at TIMESTAMP

      from raw_data
      cross join unnest(campaign) as c
   ) a

cross join unnest(campaignstatus) as cs

但是,我很想阅读更复杂的解决方案.

However, I would love to read more sophisticated solutions.

这篇关于如何在 Presto 中并行执行多个数组 unnest()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆