配置单元sql找到最新的记录 [英] hive sql find the latest record

查看:210
本文介绍了配置单元sql找到最新的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



  create table test(
id string,
name string,
年龄字符串,
修改过的字符串)

这样的数据:

  id名称年龄修改
1 a 10 2011-11-11 11:11:11
1 a 11 2012- 11-11 12:00:00
2 b 20 2012-12-10 10:11:12
2 b 20 2012-12-10 10:11:12
2 b 20 2012 -12-12 10:11:12
2 b 20 2012-12-15 10:11:12

我想通过id获得最新的记录(包括每个colums id,name,age,modifed),如上面的数据,正确的结果是:

  1 a 11 2012-11-11 12:00:00 
2 b 20 2012-12-15 10:11:12



我喜欢这样:

 插入覆盖表t 
选择b.id,b.name,b.age,b.modified
from(
select id,max(modified)as modified
from test
group by id $ b $ a)
离开外部连接测试b(a.id = b.id和a.modified = b.modified);

这个sql可以得到正确的结果,但是当海量数据时,它运行缓慢。



**有没有办法做到这一点,没有左外部连接? **

解决方案

Hive SQL几乎没有文档记录(我发现它在其中一个Jira bug报告中),它可以让你使用struct()来做类似argmax()的事情。例如,如果您有一个如下表格:

  test_argmax 
id,val,key
1,1 ,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2,X
2,1,Y

您可以这样做:

  select 
max(struct( val,key,id))。col1 as max_val,
max(struct(val,key,id))。col2 as max_key,
max(struct(val,key,id))。col3 as max_id
from test_argmax
by id

并得到结果:

  max_val,max_key,max_id 
3,C,1
3,W,2

我认为在绑定val(第一个struct元素)的情况下,它会回退到第二列的比较。我还没有弄清楚是否有一个更简洁的语法来获取单个列返回结果结构,可能使用named_struct?$ /

the table is:

create table test (
id string,
name string,
age string,
modified string)

data like this:

id    name   age  modifed
1     a      10   2011-11-11 11:11:11
1     a      11   2012-11-11 12:00:00
2     b      20   2012-12-10 10:11:12
2     b      20   2012-12-10 10:11:12
2     b      20   2012-12-12 10:11:12
2     b      20   2012-12-15 10:11:12

I want to get the latest record(include every colums id,name,age,modifed) group by id,as the data above,the correct result is:

1     a      11   2012-11-11 12:00:00
2     b      20   2012-12-15 10:11:12

I do like this:

insert overwrite table t 
select b.id, b.name, b.age, b.modified 
from (
        select id,max(modified) as modified 
        from test 
        group by id
) a 
left outer join test b on (a.id=b.id  and a.modified=b.modified);

This sql can get the right result,but when mass data,it runs slow.

**Is there any way to do this without left outer join? **

解决方案

There's a nearly undocumented feature of Hive SQL (I found it in one of their Jira bug reports) that lets you do something like argmax() using struct()s. For example if you have a table like:

test_argmax
id,val,key
1,1,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2,X
2,1,Y

You can do this:

select 
  max(struct(val, key, id)).col1 as max_val,
  max(struct(val, key, id)).col2 as max_key,
  max(struct(val, key, id)).col3 as max_id
from test_argmax
group by id

and get the result:

max_val,max_key,max_id
3,C,1
3,W,2

I think in case of ties on val (the first struct element) it will fall back to comparison on the second column. I also haven't figured out whether there's a neater syntax for getting the individual columns back out of the resulting struct, maybe using named_struct somehow?

这篇关于配置单元sql找到最新的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆