by()函数是否使列表不断增加 [英] does the by( ) function make growing list
问题描述
by
函数是否使列表一次增加一个元素?
Does the by
function make a list that grows one element at a time?
我需要处理一个数据帧,其中约有4M观察值按因子列分组.情况类似于以下示例:
I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below:
> # Make 4M rows of data
> x = data.frame(col1=1:4000000, col2=10000001:14000000)
> # Make a factor
> x[,"f"] = x[,"col1"] - x[,"col1"] %% 5
>
> head(x)
col1 col2 f
1 1 10000001 0
2 2 10000002 0
3 3 10000003 0
4 4 10000004 0
5 5 10000005 5
6 6 10000006 5
现在,其中一列上的tapply
将花费相当长的时间:
Now, a tapply
on one of the columns takes a reasonable amount of time:
> t1 = Sys.time()
> z = tapply(x[, 1], x[, "f"], mean)
> Sys.time() - t1
Time difference of 22.14491 secs
但是,如果我这样做:
z = by(x[, 1], x[, "f"], mean)
那几乎不可能在同一时间完成(我在一分钟后就放弃了).
That doesn't finish anywhere near the same time (I gave up after a minute).
当然,在上面的示例中,可以使用tapply
,但是我实际上需要一起处理多个列.更好的方法是什么?
Of course, in the above example, tapply
could be used, but I actually need to process multiple columns together. What is the better way to do this?
推荐答案
by
比tapply
慢,因为它包装了by
.
让我们看一些基准测试:在这种情况下,tapply
的速度比使用by
by
is slower than tapply
because it is wrapping by
.
Let's take a look at some benchmarks: tapply
in this situation is more than 3x faster than using by
已更新以包含@Roland的出色建议:
UPDATED to include @Roland's great recomendation:
library(rbenchmark)
library(data.table)
dt <- data.table(x,key="f")
using.tapply <- quote(tapply(x[, 1], x[, "f"], mean))
using.by <- quote(by(x[, 1], x[, "f"], mean))
using.dtable <- quote(dt[,mean(col1),by=key(dt)])
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
#------------------------#
# RESULTS #
#------------------------#
# COMPARING tapply VS by #
#-----------------------------------
# test elapsed relative
# 1 using.tapply 2.453 1.000
# 2 using.by 8.889 3.624
# COMPARING data.table VS tapply VS by #
#------------------------------------------#
# test elapsed relative
# 2 using.dtable 0.168 1.000
# 1 using.tapply 2.396 14.262
# 3 using.by 8.566 50.988
如果x $ f是一个因数,tapply和by之间的效率损失会更大!
尽管请注意,相对于非要素输入,它们都有所改善,而data.table保持大致相同或更差
x[, "f"] <- as.factor(x[, "f"])
dt <- data.table(x,key="f")
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
# test elapsed relative
# 2 using.dtable 0.175 1.000
# 1 using.tapply 1.803 10.303
# 3 using.by 7.854 44.880
?by
:
说明
Function by是一个面向对象的包装器,用于轻触地应用于数据帧.
Function by is an object-oriented wrapper for tapply applied to data frames.
让我们看一下by
(或更具体地说,是by.data.frame
)的来源:
let's take a look at the source for by
(or more specificaly, by.data.frame
):
by.data.frame
function (data, INDICES, FUN, ..., simplify = TRUE)
{
if (!is.list(INDICES)) {
IND <- vector("list", 1L)
IND[[1L]] <- INDICES
names(IND) <- deparse(substitute(INDICES))[1L]
}
else IND <- INDICES
FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
nd <- nrow(data)
ans <- eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)),
data)
attr(ans, "call") <- match.call()
class(ans) <- "by"
ans
}
我们立即看到仍然有对tapply
的调用以及许多其他功能(包括对deparse(substitute(.))
和eval(substitute(.))
的调用,两者都相对较慢).因此,与by
类似调用相比,您的tapply
相对更快.
We see immediately that there is still a call to tapply
plus a lot of extras (including calls to deparse(substitute(.))
and an eval(substitute(.))
both of which are relatively slow). Therefore it makes sense that your tapply
will be relatively faster than a similar call to by
.
这篇关于by()函数是否使列表不断增加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!