查找与值关联的子组，并计算每个ID所包含的连续子组 [英] Find sub-groups associate to a value and count consecutives sub-groups withing each ID

查看：77 发布时间：2020/10/15 20:39:35 r performance data.table

本文介绍了查找与值关联的子组，并计算每个ID所包含的连续子组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在请您帮助设计一个非常有效的解决方案，以快速遍历14kk行表。

I am asking your help to design a very efficient solution to rapidly go through a 14kk rows table.

基本上，问题在于为每个ID查找具有Value ==的子组

Basically, the problem consists of finding for each ID the subgroups with Value==0 and count the number of consecutives subgroups (within each ID) with Value == 0 starting from him.

此新信息需要保存在由 ID组成的外部表中，该连续子组的数量为每个值（从ID开始），其中Value == 0。，子组和计数。

This new information needs to be saved in an external table composed of "ID", "subgroup" and "Count".

为了尽可能清晰，我将举一个例子。

假设我们有以下数据库：

Trying to be as clear as possible, I will present an example.
Supposing we have the following database:

ID <- (1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)   
subgroup <- c("1a1p", "1a2p", "1a3p", "2a1p", "2a2p", "2a3p", "2a4p", "2a5p", "2a6p", "3a1p", "3a2p", "3a3p", "3a4p", "3a5p")
Value <- c(2000, 0, 0, 0, 0, 0, 0, 2000, 1800, 0, 0 , 0, 1750, 0)

df <- data.frame(ID, subgroup, Value)

对于每个 ID ，我们需要找到所有子组对应于 Value == 0 然后是 count 值== 0的连续ID的数量。

因此，解决方案必须看起来像这样：

For each ID, we need to find all the subgroup corresponding to a Value == 0 and then count the number of consecutive IDs with Value == 0.
Therefore, the solution must look like:

ID <- c(1, 2, 3, 3)    
subgroup <- c("1a2p", "2a2p", "3a1p", "3a5p")
count <- c(1, 3, 2, 0)
solution_df <- data.frame(ID, subgroup, count)

请注意，子组== 1a2p 与 count == 0 表示它具有 Value == 0 的特征，但有0个子组以相同的 Value == 0 连续给他。

Note that the subgroup == "1a2p" is present with a count == 0 meaning that it is featured with Value == 0 but there are 0 subgroup consecutive to him with the same Value == 0.

我真的希望我尽可能清楚。

I really hope I was as clear as possible.

使用 system.info（）在以前版本的问题上测得的性能正在仅考虑功能子组和值：

Performance measured using system.info() on the previous version of the problem where it was considering only features subgroup and Value:

data.table 方法

user: 881.21  system: 109.73  elapsed: 993.43

dplyr 方法

user: 91.66  system: 0.56  elapsed: 93.05

base R 方法

user: 1.67 system: 0.29  elapsed: 2.06

更新

更新后的任务的性能：

Update

Performance on the updated task:

dplyr 方法

user: 75.28  system: 1.00  elapsed: 77.16

base R 方法

user: 104.75 system: 0.61  elapsed: 105.74

Update

With the updated question, we can do a group by

df %>% 
    mutate(grp = c(TRUE, diff(Value==0)>0)) %>%
    filter(Value == 0) %>%
    group_by(ID, grp = cumsum(grp)) %>%
    summarise(subgroup = first(subgroup), count = n()-1) %>% 
    ungroup() %>% 
    select(-grp)
# A tibble: 4 x 3
#    ID subgroup count
#  <dbl>   <fctr> <dbl>
#1     1     1a2p     1
#2     2     2a1p     3
#3     3     3a1p     2
#4     3     3a5p     0

或使用 base R

res <- setNames(stack(with(df, tapply(Value == 0, ID, FUN = 
   function(x) with(rle(x), lengths[values]-1))))[2:1], c("ID", "count"))
i1 <- with(rle(df$Value == 0), rep(seq_along(values)*values, lengths))

res$subgroup <- df$subgroup[!duplicated(cbind(df['ID'], i1)) & i1 > 0]
res
#   ID count subgroup
#1  1     1     1a2p
#2  2     3     2a1p
#3  3     2     3a1p
#4  3     0     3a5p

这篇关于查找与值关联的子组，并计算每个ID所包含的连续子组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

查找与值关联的子组，并计算每个ID所包含的连续子组 [英] Find sub-groups associate to a value and count consecutives sub-groups withing each ID

问题描述

更新

Update

推荐答案

Update

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

查找与值关联的子组，并计算每个ID所包含的连续子组 [英] Find sub-groups associate to a value and count consecutives sub-groups withing each ID

问题描述

更新

Update

推荐答案

Update

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭