为什么DT1 [DT2] [,value1-value]比具有较少列的data.table上的DT1 [DT2,value1-value]快? [英] Why is DT1[DT2][, value1-value] faster than DT1[DT2, value1-value] on data.table with fewer columns?
问题描述
这与此问题相关(我可以在data.table连接中访问`j'中的重复列名称吗?
This is related to this question (Can I access repeated column names in `j` in a data.table join?), that was asked because I assumed that the opposite to this was true.
假设您希望加入两个 data.tables
,然后对两个连接的列执行一个简单的操作,在的一个或两个调用中。[
:
Suppose you wish to join two data.tables
and then perform a simple operation on two joined columns, this can be done either in one or two calls to .[
:
N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))
DT2 = data.table(name = 1:N, value1 = rnorm(N))
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]}) # One Step
system.time({x = DT1[DT2][, value1 - value]}) # Two Step
首先做连接,然后进行减法 - 明显比一切都快一些
It turns out that making two calls - doing the join first, and then doing the subtraction - is noticeably quicker than all in one go.
> system.time({x = DT1[DT2, value1 - value]})
user system elapsed
0.67 0.00 0.67
> system.time({x = DT1[DT2][, value1 - value]})
user system elapsed
0.14 0.01 0.16
为什么是这样?
如果你在 data.table
中添加了很多列,那么你最终会发现一步法更快 - 因为 data.table
只使用您在 j
中引用的列。
If you put a LOT of columns in to the data.table
then you do eventually find that the one step approach is quicker - presumably because data.table
only uses the columns you reference in j
.
N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]})
system.time({x = DT1[DT2][, value1 - value]})
> system.time({x = DT1[DT2, value1 - value]})
user system elapsed
0.89 0.02 0.90
> system.time({x = DT1[DT2][, value1 - value]})
user system elapsed
1.64 0.16 1.81
推荐答案
我认为这是由于重复子集化 DT1 [DT2,value1-value]
为 DT2
中的每个名称
也就是说,你必须为每个 i
执行 j
操作,而不是只有一个 j
操作后加入
。这使得1e6个唯一条目变得相当昂贵。也就是说, [。data.table
变得显着和明显。
I think this is due to the repeated subsetting DT1[DT2, value1-value]
makes for every name
in DT2
. That is, you've to perform a j
operation for each i
here, as opposed to just one j
operation after the join
. This becomes quite costly with 1e6 unique entries. That is, [.data.table
becomes significant and noticeable.
DT1[DT2][, value1-value] # similar to rowSums
DT1[DT2, value1-value]
在第一种情况下, DT1 [DT2]
,首先执行 join
真的快。当然,有了更多的列,如你所展示的,你会看到一个区别。但要点是执行连接一次。但是在第二种情况下,你将DT1的名称和DT2的名称分组,对于每一个你计算的差异。也就是说,您为 DT2
的每个值子集化 DT1
- 每个子集一个'j'操作!通过运行此操作,您可以看到更好的效果:
In the first case, DT1[DT2]
, you perform the join
first, and it is really fast. Of course, with more columns, as you show, you'll see a difference. But the point is performing the join once. But in the second case, you're grouping DT1 by DT2's name and for every one of them you're computing the difference. That is, you're subsetting DT1
for each value of DT2
- one 'j' operation per subset! You can see this better by just running this:
Rprof()
t1 <- DT1[DT2, value1-value]
Rprof(NULL)
summaryRprof()
# $by.self
# self.time self.pct total.time total.pct
# "[.data.table" 0.96 97.96 0.98 100.00
# "-" 0.02 2.04 0.02 2.04
Rprof()
t2 <- DT1[DT2][, value1-value]
Rprof(NULL)
summaryRprof()
# $by.self
# self.time self.pct total.time total.pct
# "[.data.table" 0.22 84.62 0.26 100.00
# "-" 0.02 7.69 0.02 7.69
# "is.unsorted" 0.02 7.69 0.02 7.69
当你有太多的列和 join
在许多列超过作为耗时的操作。你可以通过剖析其他代码自己检查这一点。
This overhead in repeated subsetting seems to be overcome when you've too many columns and the join
on many columns overtakes as the time-consuming operation. You can probably check this out yourself by profiling the other code.
这篇关于为什么DT1 [DT2] [,value1-value]比具有较少列的data.table上的DT1 [DT2,value1-value]快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!