为什么transform.data.table比transform.data.frame慢得多? [英] Why is transform.data.table so much slower than transform.data.frame?

查看:54
本文介绍了为什么transform.data.table比transform.data.frame慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个小的data.table,并且使用 transform 会花费很多时间。这是一个可重现的示例:

I have a small data.table and using transform with it takes forever. Here is a reproducible example:

library(data.table)
#data.table 1.8.8
set.seed(1) 

dataraw <- data.table(sig1 = runif(80000, 0, 9999),
                      sig2 = runif(80000, 0, 9999),
                      sig3 = runif(80000, 0, 9999))

system.time(transform(dataraw, d = 1))
#  user      system     elapsed 
#16.345       0.016      16.359 

dataraw2 <- as.data.frame(dataraw)

system.time(transform(dataraw2, d = 1))
# user      system     elapsed 
#0.002       0.002       0.005 

为什么转换与data.frame相比使用data.table这么慢?

Why is transform so slow with a data.table in comparison to when used with a data.frame?

推荐答案

更新:在v1.8.10中已修复。来自新闻:



Update: This has been fixed long back, in v1.8.10. From NEWS:


o <$ c $上 transform()的慢度c> data.table 已修复,#2599 。但是,请使用:=

o The slowness of transform() on data.table has been fixed, #2599. But, please use :=.






尽管从文档和?transform.data.table (同样来自SenorO的帖子)中可以很明显地看出,习惯用法是使用:= (按引用分配),这非常快,我认为了解 为什么 转换仍然很有趣 data.table 上速度较慢。到目前为止,根据我的理解, transform.data.table 并不总是较慢


Although it's clear from the documentation and from ?transform.data.table (from SenorO's post as well) that the idiomatic way is to use := (assign by reference), which is incredibly fast, I think it's still interesting to know why transform is slower on data.table. From what I've managed to comprehend so far, transform.data.table is not always slower.

我会在这里尝试回答。每秒 transform.data.table 似乎不是问题,而是调用 data.table()函数。通过查看 data.table ::: transform.data.table ,滞后现象来自以下行:

I'll make an attempt to answer that here. It doesn't seem to be a problem with transform.data.table per-se, rather in its call to data.table() function. By looking at data.table:::transform.data.table, the lag comes from the line:

ans <- do.call("data.table", c(list(`_data`), e[!matched]))

因此,让我们用一个大的 data.table 对该行进行基准测试,并按顺序排列值:

So, let's benchmark this line with a big data.table with values in order:

DT <- data.table(x=1:1e5, y=1:1e5, z=1:1e5)
system.time(do.call("data.table", c(list(DT), list(d=1))))
   user  system elapsed 
  0.003   0.003   0.026 

哦,这太快了!让我们进行相同的基准测试,但是顺序使用 not 值:

Oh this is extremely fast! Let's benchmark the same, but with values not in order:

DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5))
system.time(do.call("data.table", c(list(DT), list(d=1))))

   user  system elapsed 
  7.986   0.016   8.099 

# tested on 1.8.8 and 1.8.9

它变慢了。是什么造成了这种差异?为此,我们必须调试 data.table()函数。通过

It gets slow. What's causing this difference? To do that we'll have to debug data.table() function. By doing

DT <- data.table(x=as.numeric(1:1e5), y=as.numeric(1:1e5), z=as.numeric(1:1e5))
debugonce(data.table)
transform(DT, d=1)

并依次点击 enter ,您将能够找到这种缓慢的原因所在:

and by hitting "enter" successively, you'll be able to find the reason for the such slowness is at the line:

exptxt = as.character(tt) # roughly about 7.2 seconds

很明显, as.character 成为问题。为什么?为此,请比较:

It's clear that as.character becomes the issue. Why? To do this, compare:

as.character(data.frame(x=1:10, y=1:10))
# [1] "1:10" "1:10"

as.character(data.frame(x=sample(10), y=sample(10)))
# [1] "c(9, 10, 4, 7, 6, 5, 1, 3, 8, 2)" "c(8, 5, 3, 7, 6, 10, 9, 1, 4, 2)"

在更大的数据上重复此操作,以发现为采样的 data.frame 上的.character 变慢了

Repeat this on bigger data to see that as.character on sampled data.frame gets slower.

现在,问题就变成了为什么

Now then, the question becomes, why isn't

data.table(x = sample(1e5), y=sample(1e5))

是否耗时?这是因为,给 data.table()函数的输入是替换(带有 subsitute())。在这种情况下, tt 变为:

time consuming? This is because, the input given to data.table() function is substituted (with subsitute()). In this case, tt becomes:

$x
sample(1e+05)

$y
sample(1e+05)

字符(tt)然后变为:

# [1] "sample(1e+05)" "sample(1e+05)"

这意味着,如果您要这样做:

This means, if you were to do:

DT <- data.table(x = c(1,3,4,1,4,1,3,1,2...), y = c(1,1,4,1,3,4,1,1,3...))

我想这会花费很多时间(通常不会这样做,因此没有问题)

I'd suppose that this'll take a LOT of time (which one doesn't usually do and hence no issues).

这篇关于为什么transform.data.table比transform.data.frame慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆