为什么transform.data.table比transform.data.frame慢得多? [英] Why is transform.data.table so much slower than transform.data.frame?
问题描述
我有一个小的data.table,并且使用 transform
会花费很多时间。这是一个可重现的示例:
I have a small data.table and using transform
with it takes forever. Here is a reproducible example:
library(data.table)
#data.table 1.8.8
set.seed(1)
dataraw <- data.table(sig1 = runif(80000, 0, 9999),
sig2 = runif(80000, 0, 9999),
sig3 = runif(80000, 0, 9999))
system.time(transform(dataraw, d = 1))
# user system elapsed
#16.345 0.016 16.359
dataraw2 <- as.data.frame(dataraw)
system.time(transform(dataraw2, d = 1))
# user system elapsed
#0.002 0.002 0.005
为什么转换
与data.frame相比使用data.table这么慢?
Why is transform
so slow with a data.table in comparison to when used with a data.frame?
推荐答案
更新:在v1.8.10中已修复。来自新闻:
Update: This has been fixed long back, in v1.8.10. From NEWS:
o <$ c $上
transform()
的慢度c> data.table 已修复,#2599
。但是,请使用:=
。
o The slowness of
transform()
ondata.table
has been fixed,#2599
. But, please use:=
.
尽管从文档和?transform.data.table
(同样来自SenorO的帖子)中可以很明显地看出,习惯用法是使用:=
(按引用分配),这非常快,我认为了解 为什么 转换仍然很有趣
在 data.table
上速度较慢。到目前为止,根据我的理解, transform.data.table
并不总是较慢。
Although it's clear from the documentation and from ?transform.data.table
(from SenorO's post as well) that the idiomatic way is to use :=
(assign by reference), which is incredibly fast, I think it's still interesting to know why transform
is slower on data.table
. From what I've managed to comprehend so far, transform.data.table
is not always slower.
我会在这里尝试回答。每秒 transform.data.table
似乎不是问题,而是调用 data.table()
函数。通过查看 data.table ::: transform.data.table
,滞后现象来自以下行:
I'll make an attempt to answer that here. It doesn't seem to be a problem with transform.data.table
per-se, rather in its call to data.table()
function. By looking at data.table:::transform.data.table
, the lag comes from the line:
ans <- do.call("data.table", c(list(`_data`), e[!matched]))
因此,让我们用一个大的 data.table
对该行进行基准测试,并按顺序排列值:
So, let's benchmark this line with a big data.table
with values in order:
DT <- data.table(x=1:1e5, y=1:1e5, z=1:1e5)
system.time(do.call("data.table", c(list(DT), list(d=1))))
user system elapsed
0.003 0.003 0.026
哦,这太快了!让我们进行相同的基准测试,但是顺序使用 not 值:
Oh this is extremely fast! Let's benchmark the same, but with values not in order:
DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5))
system.time(do.call("data.table", c(list(DT), list(d=1))))
user system elapsed
7.986 0.016 8.099
# tested on 1.8.8 and 1.8.9
它变慢了。是什么造成了这种差异?为此,我们必须调试 data.table()
函数。通过
It gets slow. What's causing this difference? To do that we'll have to debug data.table()
function. By doing
DT <- data.table(x=as.numeric(1:1e5), y=as.numeric(1:1e5), z=as.numeric(1:1e5))
debugonce(data.table)
transform(DT, d=1)
并依次点击 enter ,您将能够找到这种缓慢的原因所在:
and by hitting "enter" successively, you'll be able to find the reason for the such slowness is at the line:
exptxt = as.character(tt) # roughly about 7.2 seconds
很明显, as.character
成为问题。为什么?为此,请比较:
It's clear that as.character
becomes the issue. Why? To do this, compare:
as.character(data.frame(x=1:10, y=1:10))
# [1] "1:10" "1:10"
as.character(data.frame(x=sample(10), y=sample(10)))
# [1] "c(9, 10, 4, 7, 6, 5, 1, 3, 8, 2)" "c(8, 5, 3, 7, 6, 10, 9, 1, 4, 2)"
在更大的数据上重复此操作,以发现为采样的
变慢了 。 data.frame
上的.character
Repeat this on bigger data to see that as.character
on sampled data.frame
gets slower.
现在,问题就变成了为什么
Now then, the question becomes, why isn't
data.table(x = sample(1e5), y=sample(1e5))
是否耗时?这是因为,给 data.table()
函数的输入是替换(带有 subsitute()
)。在这种情况下, tt
变为:
time consuming? This is because, the input given to data.table()
function is substituted (with subsitute()
). In this case, tt
becomes:
$x
sample(1e+05)
$y
sample(1e+05)
和字符(tt)
然后变为:
# [1] "sample(1e+05)" "sample(1e+05)"
这意味着,如果您要这样做:
This means, if you were to do:
DT <- data.table(x = c(1,3,4,1,4,1,3,1,2...), y = c(1,1,4,1,3,4,1,1,3...))
我想这会花费很多时间(通常不会这样做,因此没有问题)
I'd suppose that this'll take a LOT of time (which one doesn't usually do and hence no issues).
这篇关于为什么transform.data.table比transform.data.frame慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!