Hive查询生成与条件匹配的行序列的标识符 [英] Hive query generating identifiers for a sequence of row matching a condition
本文介绍了Hive查询生成与条件匹配的行序列的标识符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
假设我有下面的hive表作为输入,我们称它为 connections :
Let's say I have the following hive table as input, let's call it connections
:
userid | timestamp
--------|-------------
1 | 1433258019
1 | 1433258020
2 | 1433258080
2 | 1433258083
2 | 1433258088
2 | 1433258170
[...] | [...]
使用以下查询:
SELECT
userid,
timestamp,
timestamp - LAG(timestamp, 1, 0) OVER w AS timediff
CASE
WHEN timediff > 60
THEN 'new_session'
ELSE 'same_session'
END AS session_state
FROM connections
WINDOW w PARTITION BY userid ORDER BY timestamp ASC;
我生成以下输出:
userid | timestamp | timediff | session_state
--------|-------------|------------|---------------
1 | 1433258019 | 1433258019 | new_session
1 | 1433258020 | 1 | same_session
2 | 1433258080 | 1433258080 | new_session
2 | 1433258083 | 3 | same_session
2 | 1433258088 | 5 | same_session
2 | 1433258170 | 82 | new_session
[...] | [...] | [...] | [...]
我会如何做到这一点:
userid | timestamp | timediff | sessionid
--------|-------------|------------------------------
1 | 1433258019 | 1433258019 | user1-session-1
1 | 1433258020 | 1 | user1-session-1
2 | 1433258080 | 1433258080 | user2-session-1
2 | 1433258083 | 3 | user2-session-1
2 | 1433258088 | 5 | user2-session-1
2 | 1433258170 | 82 | user2-session-2
[...] | [...] | [...] | [...]
仅仅使用HQL和着名UDF是可能的(我宁愿不使用自定义UDF或Reducer脚本)?
Is that possible using only HQL and "famous" UDFs (I'd rather not use custom UDFs or reducer scripts) ?
推荐答案
有趣的问题。根据您对@Madhu的评论,我在您的示例中添加了 2 1433258172
这一行。你需要的是每次增加 timediff> 60
得到满足。最简单的方法是标记它,然后在窗口中累计求和。
Interesting question. Per your comment to @Madhu, I added the line 2 1433258172
to your example. What you need is to increment every time timediff > 60
is satisfied. The easiest way to do this is to flag it and then cumulatively sum over the window.
查询:
Query:
select userid
, timestamp
, concat('user', userid, '-session-', s_sum) sessionid
from (
select *
, sum( counter ) over (partition by userid
order by timestamp asc
rows between unbounded preceding and current row) s_sum
from (
select *
, case when timediff > 60 then 1 else 0 end as counter
from (
select userid
, timestamp
, timestamp - lag(timestamp, 1, 0) over (partition by userid
order by timestamp asc) timediff
from connections ) x ) y ) z
输出:
Output:
1 1433258019 user1-session-1
1 1433258020 user1-session-1
2 1433258080 user2-session-1
2 1433258083 user2-session-1
2 1433258088 user2-session-1
2 1433258170 user2-session-2
2 1433258172 user2-session-2
这篇关于Hive查询生成与条件匹配的行序列的标识符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文