查找与值关联的子组,并计算每个ID所包含的连续子组 [英] Find sub-groups associate to a value and count consecutives sub-groups withing each ID
问题描述
我正在请您帮助设计一个非常有效的解决方案,以快速遍历14kk行表。
I am asking your help to design a very efficient solution to rapidly go through a 14kk rows table.
基本上,问题在于为每个ID查找具有Value ==的子组
Basically, the problem consists of finding for each ID the subgroups with Value==0 and count the number of consecutives subgroups (within each ID) with Value == 0 starting from him.
此新信息需要保存在由 ID组成的外部表中,该连续子组的数量为每个值(从ID开始),其中Value == 0。 ,子组和计数。
This new information needs to be saved in an external table composed of "ID", "subgroup" and "Count".
为了尽可能清晰,我将举一个例子。
假设我们有以下数据库:
Trying to be as clear as possible, I will present an example.
Supposing we have the following database:
ID <- (1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)
subgroup <- c("1a1p", "1a2p", "1a3p", "2a1p", "2a2p", "2a3p", "2a4p", "2a5p", "2a6p", "3a1p", "3a2p", "3a3p", "3a4p", "3a5p")
Value <- c(2000, 0, 0, 0, 0, 0, 0, 2000, 1800, 0, 0 , 0, 1750, 0)
df <- data.frame(ID, subgroup, Value)
对于每个 ID
,我们需要找到所有子组
对应于 Value == 0
然后是 count
值== 0的连续ID的数量。
因此,解决方案必须看起来像这样:
For each ID
, we need to find all the subgroup
corresponding to a Value == 0
and then count
the number of consecutive IDs with Value == 0.
Therefore, the solution must look like:
ID <- c(1, 2, 3, 3)
subgroup <- c("1a2p", "2a2p", "3a1p", "3a5p")
count <- c(1, 3, 2, 0)
solution_df <- data.frame(ID, subgroup, count)
请注意,子组== 1a2p
与 count == 0
表示它具有 Value == 0
的特征,但有0个子组
以相同的 Value == 0
连续给他。
Note that the subgroup == "1a2p"
is present with a count == 0
meaning that it is featured with Value == 0
but there are 0 subgroup
consecutive to him with the same Value == 0
.
我真的希望我尽可能清楚。
I really hope I was as clear as possible.
使用 system.info()
在以前版本的问题上测得的性能正在仅考虑功能子组
和值
:
Performance measured using system.info()
on the previous version of the problem where it was considering only features subgroup
and Value
:
data.table
方法
user: 881.21 system: 109.73 elapsed: 993.43
dplyr
方法
user: 91.66 system: 0.56 elapsed: 93.05
base R
方法
user: 1.67 system: 0.29 elapsed: 2.06
更新
更新后的任务的性能:
Update
Performance on the updated task:
dplyr
方法
user: 75.28 system: 1.00 elapsed: 77.16
base R
方法
user: 104.75 system: 0.61 elapsed: 105.74
推荐答案
以下是使用 dplyr
library(dplyr)
df %>%
mutate(grp = c(TRUE, diff(Value==0)>0)) %>%
filter(Value ==0) %>%
group_by(grp = cumsum(grp)) %>%
summarise(ID = first(ID), count = n()-1) %>%
ungroup() %>%
select(-grp)
# A tibble: 4 x 2
# ID count
# <fctr> <dbl>
#1 1a2p 0
#2 2a2p 2
#3 3a1p 2
#4 3a5p 0
或使用 rle
从 base R
data.frame(ID = with(df, ID[c(FALSE, diff(Value==0) > 0)]),
count = with(rle(df$Value==0), lengths[values]-1))
# ID count
#1 1a2p 0
#2 2a2p 2
#3 3a1p 2
#4 3a5p 0
< h3>更新
有了更新的问题,我们可以按照以下条件进行分组
Update
With the updated question, we can do a group by
df %>%
mutate(grp = c(TRUE, diff(Value==0)>0)) %>%
filter(Value == 0) %>%
group_by(ID, grp = cumsum(grp)) %>%
summarise(subgroup = first(subgroup), count = n()-1) %>%
ungroup() %>%
select(-grp)
# A tibble: 4 x 3
# ID subgroup count
# <dbl> <fctr> <dbl>
#1 1 1a2p 1
#2 2 2a1p 3
#3 3 3a1p 2
#4 3 3a5p 0
或使用 base R
res <- setNames(stack(with(df, tapply(Value == 0, ID, FUN =
function(x) with(rle(x), lengths[values]-1))))[2:1], c("ID", "count"))
i1 <- with(rle(df$Value == 0), rep(seq_along(values)*values, lengths))
res$subgroup <- df$subgroup[!duplicated(cbind(df['ID'], i1)) & i1 > 0]
res
# ID count subgroup
#1 1 1 1a2p
#2 2 3 2a1p
#3 3 2 3a1p
#4 3 0 3a5p
这篇关于查找与值关联的子组,并计算每个ID所包含的连续子组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!