如果满足多个条件,则具有重置选项的累积和 [英] Cumulative sum with reset option if multiple conditions are met
问题描述
如果满足多个条件,我将尝试使用reset选项进行累加。更具体地说,我想对按 id分组的变量
并在满足以下两个条件的情况下再次从0重新开始:金额
和 count
进行累加和金额
> = 10和 count
> =3。如果满足这些条件,我还想创建一个包含1的新列,否则满足0。
I am trying to make a cumulative sum with reset option if multiple conditions are met. More specifically, I want to cumulative sum the variables amount
and count
grouped by id
and reset/start from 0 again if these two conditions are met: amount
>= 10 and count
>= 3. I also would like to create a new column that contains 1 if these conditions are met and 0 otherwise.
数据样本:
df <- data.frame(
date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01")),
id = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
amount = c(1, 9, 5, 5, 6, 2, 10, 4, 8, 10, 6, 5, 5, 1, 6, 5, 5, 5),
count = c(0, 2, 5, 4, 5, 1, 0, 0, 0, 0, 2, 1, 1, 1, 1, 2, 1, 0)
)
所需的输出:
df <- data.frame(
date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01")),
id = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
amount = c(1, 9, 5, 5, 6, 2, 10, 4, 8, 10, 6, 5, 5, 1, 6, 5, 5, 5),
count = c(0, 2, 5, 4, 5, 1, 0, 0, 0, 0, 2, 1, 1, 1, 1, 2, 1, 0),
amount_cumsum = c(1, 10, 15, 5, 11, 2, 10, 14, 22, 32, 38, 43, 5, 6, 12, 5, 10, 5),
count_cumsum = c(0, 2, 7, 4, 9, 1, 0, 0, 0, 0, 2, 3, 1, 2, 3, 2, 3, 0),
condition_met = c(0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0)
)
如果可能的话,我想使用 dplyr
解决方案,但也欢迎使用其他方法。谢谢!
If possible, I would like a dplyr
solution, but alternatives are also welcome. Thanks!
更新:作者删除的答案几乎解决了问题:
Update: an answer that was deleted by the author almost got the problem solved:
df %>% group_by(id) %>%
mutate(
amount_cumsum = purrr::accumulate(.x = amount, .f = ~ if_else(condition = .x < 10, true = .x + .y, false = .y)),
count_cumsum = purrr::accumulate(.x = count, .f = ~ if_else(condition = .x < 3, true = .x + .y, false = .y)),
condition_met = as.integer(amount_cumsum >= 10 & count_cumsum >= 3)
)
或者,或者:
df %>% group_by(id) %>%
mutate(
amount_cumsum = purrr::accumulate(.x = amount, .f = ~ case_when(.x < 10 ~ .x + .y, TRUE ~ .y)),
count_cumsum = purrr::accumulate(.x = count, .f = ~ case_when(.x < 3 ~ .x + .y, TRUE ~ .y)),
condition_met = as.integer(amount_cumsum >= 10 & count_cumsum >= 3)
)
答案a如果一个变量满足条件,bove会重置累积总和,但是如果另一个条件满足,bove将重置总和。
The answer above resets the cumulative sum if the condition is met for one single variable, but not taking into account if the other condition was met.
推荐答案
使用base-R解决方案进行贡献:
Contributing with a base-R solution:
df$amount_cumsum <- 0
df$count_cumsum <- 0
df$condition_met <- 0
reset = F
for (i in 1:nrow(df)) {
if (i == 1 | reset) {
df$amount_cumsum[i] = df$amount[i]
df$count_cumsum[i] = df$count[i]
reset = F
} else if (df$id[i] != df$id[i-1]) {
df$amount_cumsum[i] = df$amount[i]
df$count_cumsum[i] = df$count[i]
reset = F
} else {
df$amount_cumsum[i] = df$amount_cumsum[i-1] + df$amount[i]
df$count_cumsum[i] = df$count_cumsum[i-1] + df$count[i]
}
if (df$amount_cumsum[i] >= 10 & df$count_cumsum[i] >= 3) {
df$condition_met[i] = 1
reset = T
}
}
我已经扩展了您的数据集,并根据您的解决方案。基准测试显示,Base-R解决方案比tidyverse解决方案快21倍!
I've expanded your dataset and benchmarked this code against your solution. Benchmark shows the Base-R solution 21 times faster than the tidyverse one!
library(tidyverse)
dates = seq(as.Date("2019-01-01"), as.Date("2020-03-04"), by="days")
df <- data.frame(
date = c(sample(dates, 300), sample(dates, 400), sample(dates, 350)),
id = c(rep("A", 300), rep("B", 400), rep("C", 350)),
amount = floor(runif(1050, 0, 15)),
count = floor(runif(1050, 0, 5)),
stringsAsFactors = F
)
rbenchmark::benchmark(
"Tidy Solution" = {
df_tidy <- df %>%
group_by(id) %>%
nest(data = c(amount, count)) %>%
mutate(
data_accumulate = purrr::accumulate(.x = data, .f = function(.x, .y) if (max(.x[1]) < 10 | max(.x[2]) < 3) .x + .y else .y)
) %>%
unnest(cols = c(data_accumulate)) %>%
rename(amount_cumsum = amount, count_cumsum = count) %>%
unnest(cols = c(data)) %>%
mutate(condition_met = case_when(
amount_cumsum >= 10 & count_cumsum >= 3 ~ 1,
TRUE ~ 0)
)
},
"Base-R Solution" = {
df_base <- df
df_base$amount_cumsum <- 0
df_base$count_cumsum <- 0
df_base$condition_met <- 0
reset = F # to reset the counters
for (i in 1:nrow(df_base)) {
if (i == 1 | reset) {
df_base$amount_cumsum[i] = df_base$amount[i]
df_base$count_cumsum[i] = df_base$count[i]
reset = F
} else if (df_base$id[i] != df_base$id[i-1]) {
df_base$amount_cumsum[i] = df_base$amount[i]
df_base$count_cumsum[i] = df_base$count[i]
reset = F
} else {
df_base$amount_cumsum[i] = df_base$amount_cumsum[i-1] + df_base$amount[i]
df_base$count_cumsum[i] = df_base$count_cumsum[i-1] + df_base$count[i]
}
if (df_base$amount_cumsum[i] >= 10 & df_base$count_cumsum[i] >= 3) {
df_base$condition_met[i] = 1
reset = T
}
}
},
replications = 100)
gc()
test replications elapsed relative user.self sys.self user.child sys.child
Base-R Solution 100 3.89 1.000 3.69 0.0 NA NA
Tidy Solution 100 84.00 21.594 78.65 0.2 NA NA
这篇关于如果满足多个条件,则具有重置选项的累积和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!