为什么在`j.`中求值比`data.table`中的`$`要快? [英] Why is it faster to evaluate in `j` than with `$` in a `data.table`?
问题描述
也许这已经回答了,我错过了,但很难搜索。
一个很简单的问题:为什么 dt [ x]
一般比 dt $ x
吗?
/ p>
dt <-data.table(id = 1:1e7,var = rnorm(1e6))
$ b b test< -microbenchmark(times = 100L,
dt [sample(1e7,size = 200000),var],
dt [sample(1e7,size = 200000),] $ var)$ b b
test [,expr]< -c(in j,$)
单位:毫秒
expr min lq平均中位数uq max neval
$ 14.28863 15.88779 18.84229 17.23109 18.41577 53.63473 100
in j 14.35916 15.97063 18.87265 17.99266 18.37939 54.19944 100
无论如何,在 j
microbenchmark
可能会吐出一些直方图)。 为什么是这种情况?
使用 j
<
$ b
(和你的调用),你是 [。data.table
中的子集,然后选择 $
你实际上是调用2个函数而不是1,因此时间上有一个可以忽略的差异。
比较返回相同结果
dt <-data.table(id = 1:1e7,var = rnorm(1e6))
setkey(dt,id)
ii < = 200000)
microbenchmark(in j= dt [。(ii),var],$= dt [。(ii)] $ var,'[[' = dt [。(ii)] [['var']],.subset2(dt [。(ii)],'var'),dt [。(ii)] [[2]],dt [['var ']] [ii],dt $ var [ii],.subset2(dt,'var')[ii])
单位:毫秒
expr min lq平均中位数uq max neval cld
in j 39.491156 40.358669 41.570057 40.860342 41.485622 70.202441 100 b
$ 39.957211 40.561965 41.587420 41.136836 41.634584 69.928363 100 b
[[40.046558 40.515480 42.388432 41.244444 41.750946 72.224827 100 b
.subset2(dt [。 )],var)39.772781 40.564077 41.561271 41.111630 41.635489 69.252222 100 b
dt [。(ii)] [[2]] 40.004300 40.513669 41.682526 40.927503 41.492866 72.986995 100 b
dt [[var] ] [ii] 4.432346 4.546898 4.946219 4.623416 4.755777 31.761115 100 a
dt $ var [ii] 4.440496 4.539502 4.668361 4.597457 4.729214 5.425125 100 a
.subset2(dt,var)[ii] 4.365939 4.508261 4.660435 4.598815 4.703858 6.072289 100 a
Perhaps this is already answered and I missed it, but it's hard to search.
A very simple question: Why is dt[,x]
generally a tiny bit faster than dt$x
?
Example:
dt<-data.table(id=1:1e7,var=rnorm(1e6))
test<-microbenchmark(times=100L,
dt[sample(1e7,size=200000),var],
dt[sample(1e7,size=200000),]$var)
test[,"expr"]<-c("in j","$")
Unit: milliseconds
expr min lq mean median uq max neval
$ 14.28863 15.88779 18.84229 17.23109 18.41577 53.63473 100
in j 14.35916 15.97063 18.87265 17.99266 18.37939 54.19944 100
I might not have chosen the best example, so feel free to suggest something perhaps more poignant.
Anyway, evaluating in j
is faster at least 75% of the time (though there appears to be a fat upper tail as the mean is higher; side note, it would be nice if microbenchmark
could spit me out some histograms).
Why is this the case?
解决方案 With j
, you are subsetting and selecting within a call to [.data.table
.
With $
(and your call), you are subsetting within [.data.table
and then selecting with $
You are in essence calling 2 functions not 1, thus there is a neglible difference in timing.
In your current example you are calling `sampling(1e,200000) each time.
For comparison to return identical results
dt<-data.table(id=1:1e7,var=rnorm(1e6))
setkey(dt, id)
ii <- sample(1e7,size=200000)
microbenchmark("in j" = dt[.(ii),var], "$"=dt[.(ii)]$var, '[[' =dt[.(ii)][['var']], .subset2(dt[.(ii)],'var'), dt[.(ii)][[2]], dt[['var']][ii], dt$var[ii], .subset2(dt,'var')[ii] )
Unit: milliseconds
expr min lq mean median uq max neval cld
in j 39.491156 40.358669 41.570057 40.860342 41.485622 70.202441 100 b
$ 39.957211 40.561965 41.587420 41.136836 41.634584 69.928363 100 b
[[ 40.046558 40.515480 42.388432 41.244444 41.750946 72.224827 100 b
.subset2(dt[.(ii)], "var") 39.772781 40.564077 41.561271 41.111630 41.635489 69.252222 100 b
dt[.(ii)][[2]] 40.004300 40.513669 41.682526 40.927503 41.492866 72.986995 100 b
dt[["var"]][ii] 4.432346 4.546898 4.946219 4.623416 4.755777 31.761115 100 a
dt$var[ii] 4.440496 4.539502 4.668361 4.597457 4.729214 5.425125 100 a
.subset2(dt, "var")[ii] 4.365939 4.508261 4.660435 4.598815 4.703858 6.072289 100 a
这篇关于为什么在`j.`中求值比`data.table`中的`$`要快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!