为什么DT1 [DT2] [,value1-value]比具有较少列的data.table上的DT1 [DT2,value1-value]快? [英] Why is DT1[DT2][, value1-value] faster than DT1[DT2, value1-value] on data.table with fewer columns?

查看:126
本文介绍了为什么DT1 [DT2] [,value1-value]比具有较少列的data.table上的DT1 [DT2,value1-value]快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与此问题相关(我可以在data.table连接中访问`j'中的重复列名称吗?

This is related to this question (Can I access repeated column names in `j` in a data.table join?), that was asked because I assumed that the opposite to this was true.

假设您希望加入两个 data.tables ,然后对两个连接的列执行一个简单的操作,在的一个或两个调用中。[

Suppose you wish to join two data.tables and then perform a simple operation on two joined columns, this can be done either in one or two calls to .[:

N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))
DT2 = data.table(name = 1:N, value1 = rnorm(N))
setkey(DT1, name)

system.time({x = DT1[DT2, value1 - value]})     # One Step

system.time({x = DT1[DT2][, value1 - value]})   # Two Step

首先做连接,然后进行减法 - 明显比一切都快一些

It turns out that making two calls - doing the join first, and then doing the subtraction - is noticeably quicker than all in one go.

> system.time({x = DT1[DT2, value1 - value]})
   user  system elapsed 
   0.67    0.00    0.67 
> system.time({x = DT1[DT2][, value1 - value]})
   user  system elapsed 
   0.14    0.01    0.16 

为什么是这样?

如果你在 data.table 中添加了很多列,那么你最终会发现一步法更快 - 因为 data.table 只使用您在 j 中引用的列。

If you put a LOT of columns in to the data.table then you do eventually find that the one step approach is quicker - presumably because data.table only uses the columns you reference in j.

N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]})
system.time({x = DT1[DT2][, value1 - value]})

> system.time({x = DT1[DT2, value1 - value]})
   user  system elapsed 
   0.89    0.02    0.90 
> system.time({x = DT1[DT2][, value1 - value]})
   user  system elapsed 
   1.64    0.16    1.81 


推荐答案

我认为这是由于重复子集化 DT1 [DT2,value1-value] DT2 中的每个名称也就是说,你必须为每个 i 执行 j 操作,而不是只有一个 j 操作后加入。这使得1e6个唯一条目变得相当昂贵。也就是说, [。data.table 变得显着和明显。

I think this is due to the repeated subsetting DT1[DT2, value1-value] makes for every name in DT2. That is, you've to perform a j operation for each i here, as opposed to just one j operation after the join. This becomes quite costly with 1e6 unique entries. That is, [.data.table becomes significant and noticeable.

DT1[DT2][, value1-value] # similar to rowSums
DT1[DT2, value1-value]

在第一种情况下, DT1 [DT2] ,首先执行 join 真的快。当然,有了更多的列,如你所展示的,你会看到一个区别。但要点是执行连接一次。但是在第二种情况下,你将DT1的名称和DT2的名称分组,对于每一个你计算的差异。也就是说,您为 DT2 的每个值子集化 DT1 - 每个子集一个'j'操作!通过运行此操作,您可以看到更好的效果:

In the first case, DT1[DT2], you perform the join first, and it is really fast. Of course, with more columns, as you show, you'll see a difference. But the point is performing the join once. But in the second case, you're grouping DT1 by DT2's name and for every one of them you're computing the difference. That is, you're subsetting DT1 for each value of DT2 - one 'j' operation per subset! You can see this better by just running this:

Rprof()
t1 <- DT1[DT2, value1-value]
Rprof(NULL)
summaryRprof()

# $by.self
#                self.time self.pct total.time total.pct
# "[.data.table"      0.96    97.96       0.98    100.00
# "-"                 0.02     2.04       0.02      2.04

Rprof()
t2 <- DT1[DT2][, value1-value]
Rprof(NULL)
summaryRprof()

# $by.self
#                self.time self.pct total.time total.pct
# "[.data.table"      0.22    84.62       0.26    100.00
# "-"                 0.02     7.69       0.02      7.69
# "is.unsorted"       0.02     7.69       0.02      7.69

当你有太多的列和 join 在许多列超过作为耗时的操作。你可以通过剖析其他代码自己检查这一点。

This overhead in repeated subsetting seems to be overcome when you've too many columns and the join on many columns overtakes as the time-consuming operation. You can probably check this out yourself by profiling the other code.

这篇关于为什么DT1 [DT2] [,value1-value]比具有较少列的data.table上的DT1 [DT2,value1-value]快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆