在dplyr中确定分组数据帧中最常见因素的最快方法 [英] Fastest way of determining most frequent factor in a grouped data frame in dplyr

查看：80 发布时间：2017/7/13 21:04:41 r performance dplyr

本文介绍了在dplyr中确定分组数据帧中最常见因素的最快方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在为几个因子变量找到一组中最常见的值，同时在dplyr中总结一个数据框。我需要一个公式来执行以下操作：

找到组中一个变量的所有因素中最常使用的因子水平（基本上/>
如果在几个最常用因子水平之间有一个关系，请选择这些因素中的任何一个级别。

返回因子级别名称（不是计数）

有几个公式可以正常工作。但是，我想到的那些都很慢。那些快速的是不方便的一次应用到数据帧中的几个变量。我想知道有没有人知道一个很好地与dplyr集成的快捷方法。

我尝试了以下内容：

生成样本数据（50000个组，100个随机字母）

  z<  -  data.frame a = rep（1：50000,100），b = sample（LETTERS，5000000，replace = TRUE））
 
 str（z）
'data.frame'：5000000 obs。的2个变量：
 $ a：int 1 2 3 4 5 6 7 8 9 10 ... 
 $ b：因子w / 26级别A，B，C， D，...：6 4 14 12 3 19 17 19 15 20 ...

清洁 - 慢速方法1

  y<  -  z％>％
 group_by（a）％>％
总结（c =名称（表（b））[which.max（table（b））]）
 
用户系统已用
 26.772 2.011 29.568

清洁 - 慢速方法2 / p>

  y<  -  z％>％
 group_by（a）％>％
总汇（ c = names（which（table（b）== max（table（b）））[1]））
 
用户系统已用
 29.329 2.029 32.361

清洁 - 慢速方法3

  y<  -  z％>％
 group_by（a）％>％
总结（c =名称（排序（表格（b） reduce = TRUE）[1]））
 
用户系统已用
 35.086 6.905 42.485

凌乱快速方法

  y<  -  z％>％
 group_by（a，b）％>％
总结（counter = n（））％>％
 group_by（a）％>％
过滤器（counter == max（counter））
y<  -  y [！duplicateated（y $ a），] 
y<  -  y $ counter <  -  NULL 
 
用户系统已用
 7.061 0.330 7.664

解决方案

data.table 仍然是最快的选择：

  z < -  data.frame（a = rep（1：50000,100），b = sample（LETTERS，5000000，replace = TRUE））

基准：

 表）
库（dplyr）
 
 #dplyr 
 system.time（{
y<  -  z％>％
 group_by（a）％ >％
 summaryize（c = names（which（table（b）== max（table（b）））[1]））
}）
用户系统已用
 14.52 0.01 14.70 
 
＃data.table 
 sy stem.time（
 setDT（z）[，.N，by = b] [order（N），] [。N，] 
）
用户系统已用
 0.05 0.02 0.06 
 
＃@ zx8754的方式 - 基础R 
 system.time（
名称（sort（table（z $ b），decre = TRUE）[1]） 
）
用户系统已用
 0.73 0.06 0.81

可以使用data.table看到：

  setDT（z）[，.N，by = b] [order N），] [。N，]

或

  #just获取名称
 setDT（z）[，.N，by = b] [order（N），] [。N，b]

似乎是最快的

所有列：

使用@ zx8754的数据

 设置。 seed（123）
 z2<  -  data.frame（a = rep（1：50000,100），
b = sample（LETTERS，5000000，replace = TRUE），
c = sample LETTERS，5000000，replace = TRUE），
d = sample（LETTERS，5000000，replace = T R $）

你可以这样做：

  #with data.table 
 system.time（
 sapply（c（'b'，'c'，'d'），function（x）{
 data.table（x = z2 [[x]]）[，.N，by = x] [order（N），] [。N，x] 
}））
用户系统经过
 0.34 0.00 0.34 
 
 #with base-R 
 system.time（
 sapply（c（b，c，d） ，函数（i）
名称（sort（table（z2 [，i]），decre = TRUE）[1]））
）
用户系统已用
 4.14 0.11 4.26

只是为了确认结果是一样的：

函数（x）{
data.table（x = z2 [[x]] ）[，.N，by = x] [order（N），] [。N，x]
}）
bcd
SNG

sapply（c b，c，d），function（i）
names（sort（table（z2 [，i]），decre = TRUE）[1]））
bcd
SNG

I am trying to find the most frequent value within a group for several factor variables while summarizing a data frame in dplyr. I need a formula that does the following:

Find the most frequently used factor level among all factors for one variable in a group (so basically "max()" for counts of factor levels).
If there is a tie between several most-used-factor levels, pick any one of those factors-levels.
Return the factor-level name (not number of counts).

There are several formulas that work. However, those that I could think of are all slow. Those that are fast are not convenient to apply to several variables in a data frame at once. I was wondering if somebody knows a fast method that integrates nicely with dplyr.

I tried the following:

generating sample data (50000 groups with 100 random letters)

z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))

str(z)
'data.frame':   5000000 obs. of  2 variables:
$ a: int  1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 26 levels "A","B","C","D",..: 6 4 14 12 3 19 17 19 15 20 ...

"Clean"-but-slow approach 1

 y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(table(b))[which.max(table(b))])

user    system  elapsed 
26.772  2.011   29.568

"Clean"-but-slow approach 2

y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(which(table(b) == max(table(b)))[1]))

user    system  elapsed 
29.329  2.029   32.361

"Clean"-but-slow approach 3

y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(sort(table(b),decreasing = TRUE)[1]))

user    system  elapsed 
35.086  6.905   42.485

"Messy"-but-fast approach

y <- z %>% 
     group_by(a,b) %>% 
     summarise(counter = n()) %>% 
     group_by(a) %>% 
     filter(counter == max(counter))
y <- y[!duplicated(y$a),]
y <- y$counter <- NULL

user   system  elapsed 
7.061  0.330   7.664

解决方案

data.table is still the fastest choice for this:

z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))

Benchmarking:

library(data.table)
library(dplyr)

#dplyr
system.time({
  y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(which(table(b) == max(table(b)))[1]))  
})
 user  system elapsed 
14.52    0.01   14.70 

#data.table
system.time(
  setDT(z)[, .N, by=b][order(N),][.N,]
)
 user  system elapsed 
 0.05    0.02    0.06 

#@zx8754 's way - base R
system.time(
  names(sort(table(z$b),decreasing = TRUE)[1])
)
   user  system elapsed 
   0.73    0.06    0.81

As it can be seen using data.table with this:

  setDT(z)[, .N, by=b][order(N),][.N,]

  #just to get the name
  setDT(z)[, .N, by=b][order(N),][.N, b]

seems to be the fastest

Update for all columns:

Using @zx8754 's data

set.seed(123)
z2 <- data.frame(a = rep(1:50000,100),
                b = sample(LETTERS, 5000000, replace = TRUE),
                c = sample(LETTERS, 5000000, replace = TRUE),
                d = sample(LETTERS, 5000000, replace = TRUE))

You could do:

#with data.table
system.time(
 sapply(c('b','c','d'), function(x) {
  data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x] 
 }))
 user  system elapsed 
 0.34    0.00    0.34 

#with base-R
system.time(
  sapply(c("b","c","d"), function(i)
    names(sort(table(z2[,i]),decreasing = TRUE)[1]))
)
 user  system elapsed 
 4.14    0.11    4.26

And just to confirm results are the same:

sapply(c('b','c','d'), function(x) {
  data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x] 
})
b c d 
S N G 

sapply(c("b","c","d"), function(i)
    names(sort(table(z2[,i]),decreasing = TRUE)[1]))
b   c   d 
"S" "N" "G"

这篇关于在dplyr中确定分组数据帧中最常见因素的最快方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在dplyr中确定分组数据帧中最常见因素的最快方法 [英] Fastest way of determining most frequent factor in a grouped data frame in dplyr

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

在dplyr中确定分组数据帧中最常见因素的最快方法 [英] Fastest way of determining most frequent factor in a grouped data frame in dplyr

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭