在dplyr中确定分组数据帧中最常见因素的最快方法 [英] Fastest way of determining most frequent factor in a grouped data frame in dplyr
问题描述
我正在为几个因子变量找到一组中最常见的值,同时在dplyr中总结一个数据框。我需要一个公式来执行以下操作:
- 找到组中一个变量的所有因素中最常使用的因子水平(基本上/>
- 如果在几个最常用因子水平之间有一个关系,请选择这些因素中的任何一个级别。
- 返回因子级别名称(不是计数)
有几个公式可以正常工作。但是,我想到的那些都很慢。那些快速的是不方便的一次应用到数据帧中的几个变量。我想知道有没有人知道一个很好地与dplyr集成的快捷方法。
我尝试了以下内容:
生成样本数据(50000个组,100个随机字母)
z< - data.frame a = rep(1:50000,100),b = sample(LETTERS,5000000,replace = TRUE))
str(z)
'data.frame':5000000 obs。的2个变量:
$ a:int 1 2 3 4 5 6 7 8 9 10 ...
$ b:因子w / 26级别A,B,C, D,...:6 4 14 12 3 19 17 19 15 20 ...
清洁 - 慢速方法1
y< - z%>%
group_by(a)%>%
总结(c =名称(表(b))[which.max(table(b))])
用户系统已用
26.772 2.011 29.568
清洁 - 慢速方法2 / p>
y< - z%>%
group_by(a)%>%
总汇( c = names(which(table(b)== max(table(b)))[1]))
用户系统已用
29.329 2.029 32.361
清洁 - 慢速方法3
y< - z%>%
group_by(a)%>%
总结(c =名称(排序(表格(b) reduce = TRUE)[1]))
用户系统已用
35.086 6.905 42.485
凌乱快速方法
y< - z%>%
group_by(a,b)%>%
总结(counter = n())%>%
group_by(a)%>%
过滤器(counter == max(counter))
y< - y [!duplicateated(y $ a),]
y< - y $ counter < - NULL
用户系统已用
7.061 0.330 7.664
data.table
仍然是最快的选择:
z < - data.frame(a = rep(1:50000,100),b = sample(LETTERS,5000000,replace = TRUE))
基准:
表)
库(dplyr)
#dplyr
system.time({
y< - z%>%
group_by(a)% >%
summaryize(c = names(which(table(b)== max(table(b)))[1]))
})
用户系统已用
14.52 0.01 14.70
#data.table
sy stem.time(
setDT(z)[,.N,by = b] [order(N),] [。N,]
)
用户系统已用
0.05 0.02 0.06
#@ zx8754的方式 - 基础R
system.time(
名称(sort(table(z $ b),decre = TRUE)[1])
)
用户系统已用
0.73 0.06 0.81
可以使用data.table看到:
setDT(z)[,.N,by = b] [order N),] [。N,]
或
#just获取名称
setDT(z)[,.N,by = b] [order(N),] [。N,b]
似乎是最快的
所有列:
使用@ zx8754的数据
设置。 seed(123)
z2< - data.frame(a = rep(1:50000,100),
b = sample(LETTERS,5000000,replace = TRUE),
c = sample LETTERS,5000000,replace = TRUE),
d = sample(LETTERS,5000000,replace = T R $)
你可以这样做:
#with data.table
system.time(
sapply(c('b','c','d'),function(x){
data.table(x = z2 [[x]])[,.N,by = x] [order(N),] [。N,x]
}))
用户系统经过
0.34 0.00 0.34
#with base-R
system.time(
sapply(c(b,c,d) ,函数(i)
名称(sort(table(z2 [,i]),decre = TRUE)[1]))
)
用户系统已用
4.14 0.11 4.26
只是为了确认结果是一样的:
函数(x){
data.table(x = z2 [[x]] )[,.N,by = x] [order(N),] [。N,x]
})
bcd
SNG
sapply(c b,c,d),function(i)
names(sort(table(z2 [,i]),decre = TRUE)[1]))
bcd
SNG
I am trying to find the most frequent value within a group for several factor variables while summarizing a data frame in dplyr. I need a formula that does the following:
- Find the most frequently used factor level among all factors for one variable in a group (so basically "max()" for counts of factor levels).
- If there is a tie between several most-used-factor levels, pick any one of those factors-levels.
- Return the factor-level name (not number of counts).
There are several formulas that work. However, those that I could think of are all slow. Those that are fast are not convenient to apply to several variables in a data frame at once. I was wondering if somebody knows a fast method that integrates nicely with dplyr.
I tried the following:
generating sample data (50000 groups with 100 random letters)
z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))
str(z)
'data.frame': 5000000 obs. of 2 variables:
$ a: int 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 26 levels "A","B","C","D",..: 6 4 14 12 3 19 17 19 15 20 ...
"Clean"-but-slow approach 1
y <- z %>%
group_by(a) %>%
summarise(c = names(table(b))[which.max(table(b))])
user system elapsed
26.772 2.011 29.568
"Clean"-but-slow approach 2
y <- z %>%
group_by(a) %>%
summarise(c = names(which(table(b) == max(table(b)))[1]))
user system elapsed
29.329 2.029 32.361
"Clean"-but-slow approach 3
y <- z %>%
group_by(a) %>%
summarise(c = names(sort(table(b),decreasing = TRUE)[1]))
user system elapsed
35.086 6.905 42.485
"Messy"-but-fast approach
y <- z %>%
group_by(a,b) %>%
summarise(counter = n()) %>%
group_by(a) %>%
filter(counter == max(counter))
y <- y[!duplicated(y$a),]
y <- y$counter <- NULL
user system elapsed
7.061 0.330 7.664
data.table
is still the fastest choice for this:
z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))
Benchmarking:
library(data.table)
library(dplyr)
#dplyr
system.time({
y <- z %>%
group_by(a) %>%
summarise(c = names(which(table(b) == max(table(b)))[1]))
})
user system elapsed
14.52 0.01 14.70
#data.table
system.time(
setDT(z)[, .N, by=b][order(N),][.N,]
)
user system elapsed
0.05 0.02 0.06
#@zx8754 's way - base R
system.time(
names(sort(table(z$b),decreasing = TRUE)[1])
)
user system elapsed
0.73 0.06 0.81
As it can be seen using data.table with this:
setDT(z)[, .N, by=b][order(N),][.N,]
or
#just to get the name
setDT(z)[, .N, by=b][order(N),][.N, b]
seems to be the fastest
Update for all columns:
Using @zx8754 's data
set.seed(123)
z2 <- data.frame(a = rep(1:50000,100),
b = sample(LETTERS, 5000000, replace = TRUE),
c = sample(LETTERS, 5000000, replace = TRUE),
d = sample(LETTERS, 5000000, replace = TRUE))
You could do:
#with data.table
system.time(
sapply(c('b','c','d'), function(x) {
data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x]
}))
user system elapsed
0.34 0.00 0.34
#with base-R
system.time(
sapply(c("b","c","d"), function(i)
names(sort(table(z2[,i]),decreasing = TRUE)[1]))
)
user system elapsed
4.14 0.11 4.26
And just to confirm results are the same:
sapply(c('b','c','d'), function(x) {
data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x]
})
b c d
S N G
sapply(c("b","c","d"), function(i)
names(sort(table(z2[,i]),decreasing = TRUE)[1]))
b c d
"S" "N" "G"
这篇关于在dplyr中确定分组数据帧中最常见因素的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!