查找与值关联的子组,并计算每个ID所包含的连续子组 [英] Find sub-groups associate to a value and count consecutives sub-groups withing each ID

查看:77
本文介绍了查找与值关联的子组,并计算每个ID所包含的连续子组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在请您帮助设计一个非常有效的解决方案,以快速遍历14kk行表。

I am asking your help to design a very efficient solution to rapidly go through a 14kk rows table.

基本上,问题在于为每个ID查找具有Value ==的子组

Basically, the problem consists of finding for each ID the subgroups with Value==0 and count the number of consecutives subgroups (within each ID) with Value == 0 starting from him.

此新信息需要保存在由 ID组成的外部表中,该连续子组的数量为每个值(从ID开始),其中Value == 0。 ,子组和计数。

This new information needs to be saved in an external table composed of "ID", "subgroup" and "Count".

为了尽可能清晰,我将举一个例子。

假设我们有以下数据库:

Trying to be as clear as possible, I will present an example.
Supposing we have the following database:

ID <- (1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)   
subgroup <- c("1a1p", "1a2p", "1a3p", "2a1p", "2a2p", "2a3p", "2a4p", "2a5p", "2a6p", "3a1p", "3a2p", "3a3p", "3a4p", "3a5p")
Value <- c(2000, 0, 0, 0, 0, 0, 0, 2000, 1800, 0, 0 , 0, 1750, 0)

df <- data.frame(ID, subgroup, Value)

对于每个 ID ,我们需要找到所有子组对应于 Value == 0 然后是 count 值== 0的连续ID的数量。

因此,解决方案必须看起来像这样:

For each ID, we need to find all the subgroup corresponding to a Value == 0 and then count the number of consecutive IDs with Value == 0.
Therefore, the solution must look like:

ID <- c(1, 2, 3, 3)    
subgroup <- c("1a2p", "2a2p", "3a1p", "3a5p")
count <- c(1, 3, 2, 0)
solution_df <- data.frame(ID, subgroup, count)

请注意,子组== 1a2p count == 0 表示它具有 Value == 0 的特征,但有0个子组以相同的 Value == 0 连续给他。

Note that the subgroup == "1a2p" is present with a count == 0 meaning that it is featured with Value == 0 but there are 0 subgroup consecutive to him with the same Value == 0.

我真的希望我尽可能清楚。

I really hope I was as clear as possible.

使用 system.info()在以前版本的问题上测得的性能正在仅考虑功能子组

Performance measured using system.info() on the previous version of the problem where it was considering only features subgroup and Value:

data.table 方法

user: 881.21  system: 109.73  elapsed: 993.43  

dplyr 方法

user: 91.66  system: 0.56  elapsed: 93.05  

base R 方法

user: 1.67 system: 0.29  elapsed: 2.06

        




更新


更新后的任务的性能:


Update

Performance on the updated task:

dplyr 方法

user: 75.28  system: 1.00  elapsed: 77.16

base R 方法

user: 104.75 system: 0.61  elapsed: 105.74


推荐答案

以下是使用 dplyr

library(dplyr)
df %>%
    mutate(grp = c(TRUE, diff(Value==0)>0)) %>% 
    filter(Value ==0) %>%
    group_by(grp = cumsum(grp)) %>%
    summarise(ID = first(ID), count = n()-1) %>%
    ungroup() %>% 
    select(-grp) 
# A tibble: 4 x 2
#    ID count
#  <fctr> <dbl>
#1   1a2p     0
#2   2a2p     2
#3   3a1p     2
#4   3a5p     0






或使用 rle base R

data.frame(ID = with(df, ID[c(FALSE, diff(Value==0) > 0)]),
                 count = with(rle(df$Value==0), lengths[values]-1))
#     ID count
#1 1a2p     0
#2 2a2p     2
#3 3a1p     2
#4 3a5p     0



< h3>更新

有了更新的问题,我们可以按照以下条件进行分组

Update

With the updated question, we can do a group by

df %>% 
    mutate(grp = c(TRUE, diff(Value==0)>0)) %>%
    filter(Value == 0) %>%
    group_by(ID, grp = cumsum(grp)) %>%
    summarise(subgroup = first(subgroup), count = n()-1) %>% 
    ungroup() %>% 
    select(-grp)
# A tibble: 4 x 3
#    ID subgroup count
#  <dbl>   <fctr> <dbl>
#1     1     1a2p     1
#2     2     2a1p     3
#3     3     3a1p     2
#4     3     3a5p     0

或使用 base R

res <- setNames(stack(with(df, tapply(Value == 0, ID, FUN = 
   function(x) with(rle(x), lengths[values]-1))))[2:1], c("ID", "count"))
i1 <- with(rle(df$Value == 0), rep(seq_along(values)*values, lengths))

res$subgroup <- df$subgroup[!duplicated(cbind(df['ID'], i1)) & i1 > 0]
res
#   ID count subgroup
#1  1     1     1a2p
#2  2     3     2a1p
#3  3     2     3a1p
#4  3     0     3a5p

这篇关于查找与值关联的子组,并计算每个ID所包含的连续子组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆