更好的HiveQL语法将一列结构分解成一个表,每个结构成员有一列? [英] Better HiveQL syntax to explode a column of structs into a table with one column per struct member?

查看:171
本文介绍了更好的HiveQL语法将一列结构分解成一个表,每个结构成员有一列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在HiveQL中寻找一个argmax()类型的函数,并在他们的bug跟踪器中发现了一个几乎没有记录的功能( https://issues.apache.org/jira/browse/HIVE-1128 ),它通过采用一个结构的max()来做我想要的,它根据第一个元素并返回整个结构。 (实际上,也许max()会通过查看后续元素来打破关系?我不知道)。

无论如何,如果我基本上想要选择整行包含某个列的最大值,我可以将该行打包到一个结构中,首先比较值,然后提取最大结构以重建最佳行。但是这个语法是重复和丑陋的。有没有更好的方法来做到这一点? (我猜自联接是另一种选择,但看起来不那么优雅,我认为效率较低?)



示例表:

  id,val,key 
1,1,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2 ,X
2,1,Y



HiveQL:

  select 
max(struct(val,key,id))。col3 as max_id, - 为了说明,在id上分组
max(结构(val,key,id))。col1 as max_val,
max(struct(val,key,id))。col2 as max_key
from test_argmax
by id

结果:

  max_id,max_val,max_key 
1,3,C
2,3,W


  select 
best.id as id,
best.val as val,
best.key as key
from(
select
max(struct(val,key,id ))与test_argmax中最好的
一样
group by id

但您似乎无法选择最好的*(它认为这是一个表别名),所以需要显式列出所有的结构成员。它看起来像inline()函数 - 它将一个结构数组展开成一个表格 - 执行许多你想要的,但不是很完美:我想将一列结构分解成一个表格。


I was looking for an argmax() type function in HiveQL and found an almost undocumented feature in their bug tracker (https://issues.apache.org/jira/browse/HIVE-1128) which does what I want by taking max() of a struct, which finds the maximum based on the first element and returns the whole struct. (Actually, maybe the max() would break ties by looking at subsequent elements? I don't know.)

Anyway, if I essentially want to select the whole row that contains the max value of some column, I can pack up the row into a struct with the comparison value first, and then extract the maximal struct back to reconstruct the best row. But the syntax is repetitive and ugly. Is there a better way to do it? (I guess a self-join is another option, but seems less elegant and I'd guess less efficient?)

Example table:

id,val,key
1,1,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2,X
2,1,Y

HiveQL:

select 
  max(struct(val, key, id)).col3 as max_id,  -- for illustration, grouping on id anyway
  max(struct(val, key, id)).col1 as max_val,
  max(struct(val, key, id)).col2 as max_key
from test_argmax
group by id

Result:

max_id,max_val,max_key
1,3,C
2,3,W

解决方案

One possibility is a nested query:

select
  best.id as id,
  best.val as val,
  best.key as key
from (
  select 
    max(struct(val, key, id)) as best 
  from test_argmax
  group by id
)

but you don't seem to be able to select best.* (it thinks that's a table alias) so need to list all the struct members explicitly. It seems like the inline() function - which explodes an array of structs into a table - does a lot of what you want, but not quite: I want to explode a column of structs into a table.

这篇关于更好的HiveQL语法将一列结构分解成一个表,每个结构成员有一列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆