Hive Data根据时间戳选择最新值 [英] Hive Data selecting latest value based on timestamp
问题描述
C1,C2,Process TimeStamp,InsertDateTimeStamp
p1, v1,2014-01-30 12:15:23,2013-10-01 05:34:23
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12 :31
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20
p2,v4,2014-02-01 09:22:52, 2013-12-01 06:12:31
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20
现在,我想根据最新的 Process TimeStamp
获取每个主键的唯一行。
如果 Process TimeStamp
相同,那么最后一行 InsertDateTimeStamp
应该选择。
所以,我的结果应该是。
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20
p2,v4,2014-02-01 09:22:52,2013-12-01 06 :12:31
如何通过HiveQL实现此目的?
我目前使用的是hive 0.10。我不能在IN或EXISTS中使用子查询。
谢谢。
select C1,s.C2,s.ProcessTimeStamp,s.InsertDateTimeStamp from(
select C1,max(named_struct('unixtime',unix_timestamp(ProcessTimeStamp,'yyyy-MM -dd HH:mm:ss'),'C2',C2,'ProcessTimeStamp',ProcessTimeStamp,'InsertDateTimeStamp',InsertDateTimeStamp))作为s
from my_table group by C1
)t;
做一个结构的最大值比较第一个字段,然后第二个字段等。你把所有的东西结构在一起,首先分析时间戳值,你会得到一个表示该行的结构。然后,通过选择单个字段来解除结构化。
I have a table having the following columns.
C1,C2,Process TimeStamp,InsertDateTimeStamp
p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20
Now, I want to fetch unique row for each primary key based on latest Process TimeStamp
.
If Process TimeStamp
is same then row having latest InsertDateTimeStamp
should be chosen.
So, my result should be.
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31
How to achieve this via HiveQL ?
I am currently using hive 0.10. I can not use subquery with IN or EXISTS.
Thanks.
select C1, s.C2, s.ProcessTimeStamp, s.InsertDateTimeStamp from (
select C1, max(named_struct('unixtime', unix_timestamp(ProcessTimeStamp, 'yyyy-MM-dd HH:mm:ss'), 'C2', C2, 'ProcessTimeStamp', ProcessTimeStamp, 'InsertDateTimeStamp', InsertDateTimeStamp)) as s
from my_table group by C1
) t;
Doing the max of a struct compares by the first field, then the second field, etc. So if you struct everything together, with the parsed timestamp value first, you get a struct representing that row. Then just un-struct it by selecting out the individual fields.
这篇关于Hive Data根据时间戳选择最新值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!