如何识别 Hive 中字符串列的重复出现? [英] How to identify repeated occurrences of a string column in Hive?

查看:56
本文介绍了如何识别 Hive 中字符串列的重复出现?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Hive 中有这样的视图:

I have a view like this in Hive:

id        sequencenumber          appname
242539622              1          A
242539622              2          A
242539622              3          A
242539622              4          B
242539622              5          B
242539622              6          C
242539622              7          D
242539622              8          D
242539622              9          D
242539622             10          B
242539622             11          B
242539622             12          D
242539622             13          D
242539622             14          F 

我希望每个 ID 都有以下视图:

I'd like to have, per each id, the following view:

id        sequencenumber          appname    appname_c
242539622              1          A             A
242539622              2          A             A
242539622              3          A             A
242539622              4          B             B_1
242539622              5          B             B_1
242539622              6          C             C
242539622              7          D             D_1
242539622              8          D             D_1
242539622              9          D             D_1
242539622             10          B             B_2
242539622             11          B             B_2
242539622             12          D             D_2
242539622             13          D             D_2
242539622             14          F             F 

或者任何接近于此的东西,可以识别序列中给定事件的重复发生.

Or anything close to this, that can identify re-occurrence of a given event in the sequence.

我的最终目标是计算每组事件(或状态,如果您希望在马尔可夫建模的上下文中)中花费的时间,并考虑是否存在任何环回.例如,上例中 B_1 花费的时间与 B_2 非常相似.

My ultimate goal is to calculate time spent in each group of events (or state if you wish in the context of Markov modeling) taking into account if there is any loop-back. For example, time spent in B_1 in the above example can be very compared to B_2.

在 Hive 中搜索过窗口函数(link)但我认为他们不能像 R/Python 那样进行逐行比较.

Have searched window functions in Hive (link) but I think they cannot to conduct row-wise comparisons like R/Python does.

推荐答案

使用 Hive 窗口函数的解决方案.我用你的数据来测试它,删除 your_table CTE 并使用你的表.结果如预期.

Solution using Hive window functions. I used your data to test it, remove your_table CTE and use your table instead. The result is as expected.

with your_table as (--remove this CTE, use your table instead
select stack(14,
'242539622', 1,'A',
'242539622', 2,'A',
'242539622', 3,'A',
'242539622', 4,'B',
'242539622', 5,'B',
'242539622', 6,'C',
'242539622', 7,'D',
'242539622', 8,'D',
'242539622', 9,'D',
'242539622',10,'B',
'242539622',11,'B',
'242539622',12,'D',
'242539622',13,'D',
'242539622',14,'F'
) as (id,sequencenumber,appname)
) --remove this CTE, use your table instead

select id,sequencenumber,appname, 
       case when sum(new_grp_flag) over(partition by id, group_name) = 1 then appname --only one group of consequent runs exists (like A)
            else        
            nvl(concat(group_name, '_', 
                       sum(new_grp_flag) over(partition by id, group_name order by sequencenumber) --rolling sum of new_group_flag
                       ),appname) 
        end appname_c       
from
(       

select id,sequencenumber,appname,
       case when appname=prev_appname or appname=next_appname then appname end group_name, --identify group of the same app
       case when appname<>prev_appname or prev_appname is null then 1 end new_grp_flag     --one 1 per each group
from       
(
select id,sequencenumber,appname,
       lag(appname)  over(partition by id order by sequencenumber) prev_appname, --need these columns
       lead(appname) over(partition by id order by sequencenumber) next_appname  --to identify groups of records w same app
from your_table --replace with your table
)s
)s
order by id,sequencenumber
;

结果:

OK
id        sequencenumber     appname    appname_c
242539622       1       A       A
242539622       2       A       A
242539622       3       A       A
242539622       4       B       B_1
242539622       5       B       B_1
242539622       6       C       C
242539622       7       D       D_1
242539622       8       D       D_1
242539622       9       D       D_1
242539622       10      B       B_2
242539622       11      B       B_2
242539622       12      D       D_2
242539622       13      D       D_2
242539622       14      F       F
Time taken: 232.319 seconds, Fetched: 14 row(s)

这篇关于如何识别 Hive 中字符串列的重复出现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆