Hive Data根据时间戳选择最新值 [英] Hive Data selecting latest value based on timestamp

查看:700
本文介绍了Hive Data根据时间戳选择最新值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  C1,C2,Process TimeStamp,InsertDateTimeStamp 
p1, v1,2014-01-30 12:15:23,2013-10-01 05:34:23
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12 :31
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20
p2,v4,2014-02-01 09:22:52, 2013-12-01 06:12:31
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20

现在,我想根据最新的 Process TimeStamp 获取每个主键的唯一行。



如果 Process TimeStamp 相同,那么最后一行 InsertDateTimeStamp 应该选择。



所以,我的结果应该是。

  p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06 :12:31

如何通过HiveQL实现此目的?



我目前使用的是hive 0.10。我不能在IN或EXISTS中使用子查询。



谢谢。

解决方案

  select C1,s.C2,s.ProcessTimeStamp,s.InsertDateTimeStamp from(
select C1,max(named_struct('unixtime',unix_timestamp(ProcessTimeStamp,'yyyy-MM -dd HH:mm:ss'),'C2',C2,'ProcessTimeStamp',ProcessTimeStamp,'InsertDateTimeStamp',InsertDateTimeStamp))作为s
from my_table group by C1
)t;

做一个结构的最大值比较第一个字段,然后第二个字段等。你把所有的东西结构在一起,首先分析时间戳值,你会得到一个表示该行的结构。然后,通过选择单个字段来解除结构化。


I have a table having the following columns.

C1,C2,Process TimeStamp,InsertDateTimeStamp
p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23 
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31 
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31 
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20

Now, I want to fetch unique row for each primary key based on latest Process TimeStamp.

If Process TimeStamp is same then row having latest InsertDateTimeStamp should be chosen.

So, my result should be.

p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31

How to achieve this via HiveQL ?

I am currently using hive 0.10. I can not use subquery with IN or EXISTS.

Thanks.

解决方案

select C1, s.C2, s.ProcessTimeStamp, s.InsertDateTimeStamp from (
  select C1, max(named_struct('unixtime', unix_timestamp(ProcessTimeStamp, 'yyyy-MM-dd HH:mm:ss'), 'C2', C2, 'ProcessTimeStamp', ProcessTimeStamp, 'InsertDateTimeStamp', InsertDateTimeStamp)) as s
  from my_table group by C1
) t;

Doing the max of a struct compares by the first field, then the second field, etc. So if you struct everything together, with the parsed timestamp value first, you get a struct representing that row. Then just un-struct it by selecting out the individual fields.

这篇关于Hive Data根据时间戳选择最新值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆