快速频率和百分比表与dplyr [英] fast frequency and percentage table with dplyr
问题描述
我一直在使用一个小的选项卡
功能,显示了向量的频率,百分比和累积百分比。输出看起来像这样
I have been using a small tab
function for some time, which shows the frequency, percent, and cumulative percent for a vector. The output looks like this
Freq Percent cum
ARSON 462 0.01988893 0.01988893
BURGLARY 22767 0.98011107 1.00000000
23229 1.00000000 NA
优秀的 dplyr
包促使我更新功能。现在我想知道如何使更新版本更快。这是旧功能
The excellent dplyr
package motivated me to update the function. Now I am wondering how I can make the updated version even faster. Here is the old function
tab = function(x,useNA =FALSE) {
k=length(unique(x[!is.na(x)]))+1
if (useNA) k=k+1
tab=array(NA,c(k,3))
colnames(tab)=c("freq.","prob.","cum.")
useNA=ifelse(useNA,"always","no")
rownames(tab)=names(c(table(x,useNA=useNA),""))
tab[-nrow(tab),1]=table(x,useNA=useNA)
tab[-nrow(tab),2]=prop.table(table(x,useNA=useNA))
tab[,3] = cumsum(tab[,2])
if(k>2) tab[nrow(tab),-3]=colSums(tab[-nrow(tab),-3])
if(k==2) tab[nrow(tab),-3]=tab[-nrow(tab),-3]
tab
}
和新的基于 dplyr
tab2 = function(x, useNA =FALSE) {
if(!useNA) if(any(is.na(x))) x = na.omit(x)
n = length(x)
out = data.frame(x,1) %.%
group_by(x) %.%
dplyr::summarise(
Freq = length(X1),
Percent = Freq/n
) %.%
dplyr::arrange(x)
ids = as.character(out$x)
ids[is.na(ids)] = '<NA>'
out = select(out, Freq, Percent)
out$cum = cumsum(out$Percent)
class(out)="data.frame"
out = rbind(out,c(n,1,NA))
rownames(out) = c(ids,'')
out
}
最后,一些性能基准:
x1 = c(rep('ARSON',462),rep('BURGLARY',22767))
x2 = c(rep('ARSON',462),rep('BURGLARY',22767),rep(NA,100))
x3 = c(c(1:10),c(1:10),1,4)
x4 = c(rep(c(1:100),500),rep(c(1:50),20),1,4)
library('rbenchmark')
benchmark(tab(x1), tab2(x1), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x1) 1.412 2.307
# 2 tab2(x1) 0.612 1.000
benchmark(tab(x2),tab2(x2), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x2) 1.351 1.475
# 2 tab2(x2) 0.916 1.000
benchmark(tab(x2,useNA=TRUE), tab2(x2,useNA=TRUE), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x2, useNA = TRUE) 1.883 2.282
# 2 tab2(x2, useNA = TRUE) 0.825 1.000
benchmark(tab(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x3) 0.997 1.000
# 2 tab2(x3) 2.194 2.201
benchmark(tab(x4), tab2(x4), table(x4), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x4) 19.481 18.714
# 2 tab2(x4) 1.041 1.000
# 3 table(x4) 6.515 6.258
tab2
除了非常短的矢量外,速度更快。性能增益在较大的向量中变得明显(参见 x4
与51002 obs)。它也比表
更快,甚至认为该功能做得更多。
tab2
is faster except for the very short vector. The performance gain becomes evident in the larger vector (see x4
with 51002 obs). It's also faster than table
even thought the function is doing much more.
现在我的问题:如何进一步提高性能?使用频率和百分比创建表格是一个非常标准的应用程序,并且在使用大型数据集时,快速实现非常好。
Now to my question: How can I further improve performance? Creating tables with frequencies and percent is a pretty standard application and a fast implementation is very nice when you work with large datasets.
编辑:Here是另外一个带有2e6向量的测试用例(包括下面提出的 data.table
解决方案)
EDIT: Here is an additional test case with a 2e6 vector (including the data.table
solution proposed below)
x5 = sample(c(1:100),2e6, replace=TRUE)
benchmark(tab(x5), tab2(x5), table(x5), tabdt(x5), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x5) 350.878 19.444
# 2 tab2(x5) 52.917 2.932
# 4 tabdt(x5) 18.046 1.000
# 3 table(x5) 98.429 5.454
推荐答案
由于我是 library(data.table)的大粉丝
我写了类似的功能:
As I'm a big fan of library(data.table)
I wrote similar function:
tabdt <- function(x){
n <- length(which(!is.na(x)))
dt <- data.table(x)
out <- dt[, list(Freq = .N, Percent = .N / n), by = x]
out[!is.na(x), CumSum := cumsum(Percent)]
out
}
> benchmark(tabdt(x1), tab2(x1), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x1) 5.60 1.879
1 tabdt(x1) 2.98 1.000
> benchmark(tabdt(x2), tab2(x2), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x2) 6.34 1.686
1 tabdt(x2) 3.76 1.000
> benchmark(tabdt(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x3) 1.65 1.000
1 tabdt(x3) 2.34 1.418
> benchmark(tabdt(x4), tab2(x4), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x4) 14.35 1.000
1 tabdt(x4) 22.04 1.536
所以数据对于
x1
和 x2
而 dplyr
在 x3
和 x4
中更快。实际上我没有看到有任何改进的余地可以使用这些方法。
And so data.table
approach was faster for x1
and x2
while dplyr
was faster for x3
and x4
. Actually I don't see any room for improvement using these approaches.
p.s。你会在这个问题中添加 data.table
关键字吗?我相信人们会喜欢看到 dplyr
vs. data.table
性能比较(见 data.table vs dplyr:can do do其他的东西好不好还是不好,比如)。
p.s. Would you add data.table
keyword to this question? I believe people would love to see dplyr
vs. data.table
performance comparison (see data.table vs dplyr: can one do something well the other can't or does poorly? for example).
这篇关于快速频率和百分比表与dplyr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!