使用dplyr mutate获得唯一值的总和 [英] cumsum for unique value using dplyr mutate
问题描述
虚拟数据集是:
data <- data.frame(
id = c(1,1,2,2,3,4,5,6),
value = c(10,10,20,20,10,30,40,50),
other = c(1,2,3,4,5,6,7,8)
)
数据是在 dplyr
管道中通过 group_by(id)
操作输出的.每个 id
最多关联一个值,并且两个不同的 id
可以具有相同的值.我需要通过添加新列来查找ID之间的累计和: cum_col = c(10,10,30,30,40,70,110,160)
mutate
中的 cumsum
将在整个值列中找到累积的总和,而不会在每个组中仅选择一个值. summaryise
没什么用,因为我还需要保持其他列不变.
The data was output of group_by(id)
operation in dplyr
pipe. Each id
is associated with at most one value and two different id
can have same value. I need to find cumulative sum across ids by adding new column:
cum_col = c(10,10,30,30,40,70,110,160)
The cumsum
in mutate
will find cumulative sum across whole column of values and doesn't pick only one value per group. summarise
is not useful as there are other columns I need to keep intact.
有没有不使用 summary
然后使用 join
-将其向后退的方法?或者,如果以前已经回答过,请指向我链接.
Is there a way out without using summarise
and then join
-ing it backward? Or please point me to link if it has been answered before.
仅作为参考,实际数据有大约200万行和100列.
Just for info the actual data has ~2 million rows and 100 columns.
推荐答案
另一种替代方法是我们创建一个虚拟列( cols
),该虚拟列每个组仅具有第一个 value
,其余部分将替换为0,然后在整个列中采用 cumsum
.
Another alternative is we create a dummy column (cols
) which has only first value
per group and rest are replaced by 0 and then we take cumsum
over the entire column.
library(dplyr)
data %>%
group_by(id) %>%
mutate(cols = c(value[1], rep(0, n() -1))) %>%
ungroup() %>%
mutate(cum_col = cumsum(cols)) %>%
select(-cols)
# A tibble: 8 x 4
# id value other cum_col
# <dbl> <dbl> <dbl> <dbl>
#1 1 10 1 10
#2 1 10 2 10
#3 2 20 3 30
#4 2 20 4 30
#5 3 10 5 40
#6 4 30 6 70
#7 5 40 7 110
#8 6 50 8 160
这篇关于使用dplyr mutate获得唯一值的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!