Hive查询生成与条件匹配的行序列的标识符 [英] Hive query generating identifiers for a sequence of row matching a condition

查看:151
本文介绍了Hive查询生成与条件匹配的行序列的标识符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有下面的hive表作为输入,我们称它为 connections :

Let's say I have the following hive table as input, let's call it connections:

userid  | timestamp   
--------|-------------
1       | 1433258019  
1       | 1433258020
2       | 1433258080
2       | 1433258083
2       | 1433258088
2       | 1433258170
[...]   | [...]

使用以下查询:

SELECT
    userid,
    timestamp,
    timestamp - LAG(timestamp, 1, 0) OVER w AS timediff
    CASE
      WHEN timediff > 60
      THEN 'new_session'
      ELSE 'same_session'
    END AS session_state
FROM connections
WINDOW w PARTITION BY userid ORDER BY timestamp ASC;

我生成以下输出:

userid  | timestamp   | timediff   | session_state
--------|-------------|------------|---------------
1       | 1433258019  | 1433258019 | new_session
1       | 1433258020  | 1          | same_session
2       | 1433258080  | 1433258080 | new_session
2       | 1433258083  | 3          | same_session
2       | 1433258088  | 5          | same_session
2       | 1433258170  | 82         | new_session
[...]   | [...]       | [...]      | [...]

我会如何做到这一点:

userid  | timestamp   | timediff   | sessionid
--------|-------------|------------------------------
1       | 1433258019  | 1433258019 | user1-session-1
1       | 1433258020  | 1          | user1-session-1
2       | 1433258080  | 1433258080 | user2-session-1
2       | 1433258083  | 3          | user2-session-1
2       | 1433258088  | 5          | user2-session-1
2       | 1433258170  | 82         | user2-session-2
[...]   | [...]       | [...]      | [...]

仅仅使用HQL和着名UDF是可能的(我宁愿不使用自定义UDF或Reducer脚本)?

Is that possible using only HQL and "famous" UDFs (I'd rather not use custom UDFs or reducer scripts) ?

推荐答案

有趣的问题。根据您对@Madhu的评论,我在您的示例中添加了 2 1433258172 这一行。你需要的是每次增加 timediff> 60 得到满足。最简单的方法是标记它,然后在窗口中累计求和。

Interesting question. Per your comment to @Madhu, I added the line 2 1433258172 to your example. What you need is to increment every time timediff > 60 is satisfied. The easiest way to do this is to flag it and then cumulatively sum over the window.

查询

Query:

select userid
  , timestamp
  , concat('user', userid, '-session-', s_sum) sessionid
from (
  select *
    , sum( counter ) over (partition by userid
                           order by timestamp asc
                           rows between unbounded preceding and current row) s_sum
  from (
    select *
      , case when timediff > 60 then 1 else 0 end as counter
    from (
      select userid
        , timestamp
        , timestamp - lag(timestamp, 1, 0) over (partition by userid
                                                 order by timestamp asc) timediff
      from connections ) x ) y ) z

输出:

Output:

1   1433258019  user1-session-1
1   1433258020  user1-session-1
2   1433258080  user2-session-1
2   1433258083  user2-session-1
2   1433258088  user2-session-1
2   1433258170  user2-session-2
2   1433258172  user2-session-2

这篇关于Hive查询生成与条件匹配的行序列的标识符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆