match.fun比R中的实际功能慢 [英] match.fun slower than actual function in R
问题描述
我有大型数据集,其中行测量的是同一事物(本质上是重复的,带有一些噪音).作为我正在编写的较大功能的一部分,我希望用户能够根据自己选择的功能(例如均值,中位数)折叠这些行.
I have large data sets with rows that measure the same thing (essentially duplicates with some noise). As part of a larger function I am writing, I want the user to be able to collapse these rows with a function of their choosing (e.g. mean, median).
我的问题是,如果我直接调用该函数,则速度要比使用match.fun(这是我所需要的)要快得多.MWE:
My problem is that if I call the function directly, speed is much faster than if I use match.fun (which is what I need). MWE:
require(data.table)
rows <- 100000
cols <- 1000
dat <- data.table(id=sample(LETTERS, rows, replace=TRUE),
matrix(rnorm(rows*cols), nrow=rows))
aggFn <- "median"
system.time(dat[, lapply(.SD, median), by=id])
system.time(dat[, lapply(.SD, match.fun(aggFn)), by=id])
在我的系统上,最后2行的计时结果:
On my system, timing results for the last 2 lines:
user system elapsed
1.112 0.027 1.141
user system elapsed
2.854 0.265 3.121
对于更大的数据集,这变得非常引人注目.
This becomes quite dramatic with larger data sets.
最后一点,我认识到aggregate()可以做到这一点(并且似乎没有受到这种行为的影响),但是由于数据大小,我需要使用data.table对象.
As a final point, I realize aggregate() can do this (and doesn't seem to suffer from this behavior), but I need to work with data.table objects due to data size.
推荐答案
原因是gforce优化data.table用于 median
.您可以看到,如果设置了 options(datatable.verbose = TRUE)
.有关详细信息,请参见 help("GForce")
.
The reason is the gforce optimization data.table does for median
. You can see that if you set options(datatable.verbose=TRUE)
. See help("GForce")
for details.
如果您比较其他功能,则会得到更多类似的计时:
If you compare for other functions you get more similar timings:
fun <- median
aggFn <- "fun"
system.time(dat[, lapply(.SD, fun), by=id])
system.time(dat[, lapply(.SD, match.fun(aggFn)), by=id])
如果碰巧支持该功能,则可以利用优化的一种可能的解决方法是使用它评估表达式构建,例如,使用可怕的 eval(parse())
:
A possible workaround to utilise the optimization if the function happens to be supported would be evaluating an expression build with it, e.g., using the dreaded eval(parse())
:
dat[, eval(parse(text = sprintf("lapply(.SD, %s)", aggFn))), by=id]
但是,使用 match.fun
添加后,您将失去小小的安全性.
However, you would lose the small security using match.fun
adds.
如果您有用户可以选择的功能列表,则可以执行以下操作:
If you have a list of functions the users can choose from, you could do this:
funs <- list(quote(mean), quote(median))
fun <- funs[[1]] #select
expr <- bquote(lapply(.SD, .(fun)))
a <- dat[, eval(expr), by=id]
这篇关于match.fun比R中的实际功能慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!