R,dplyr:n_distinct的累积版本 [英] R, dplyr: cumulative version of n_distinct
问题描述
我有一个如下数据框。它按时间
列进行排序。
I have a dataframe as follows. It is ordered by column time
.
输入-
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
我想创建另一个变量 var2
到目前为止的不同 var1
值,即直到每个时间
的点为止c $ c> grp 。这与我使用 n_distinct
会得到的有点不同。
I want to create another variable var2
which computes no of distinct var1
values so far i.e. until that point in time
for each group grp
. This is a little different from what I'd get if I were to use n_distinct
.
预期的输出-
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
我要创建为此,说一个 cum_n_distinct
并将其用作-
I want to create a function say cum_n_distinct
for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
推荐答案
假定物料在时间已经,首先定义一个累积的不同函数:
Assuming stuff is ordered by time
already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
然后使用 ave
创建组的基本解决方案(请注意,假定 var1
是因素),然后将函数应用于每个组:
Then a base solution that uses ave
to create groups (note, assumes var1
is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table
解决方案,基本上会做同样的事情:
A data.table
solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
和 dplyr
,再次是同一件事:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
这篇关于R,dplyr:n_distinct的累积版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!