为什么transform.data.table比transform.data.frame慢得多？ [英] Why is transform.data.table so much slower than transform.data.frame?

查看：54 发布时间：2020/10/15 20:57:02 r performance data.table

本文介绍了为什么transform.data.table比transform.data.frame慢得多？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个小的data.table，并且使用 transform 会花费很多时间。这是一个可重现的示例：

I have a small data.table and using transform with it takes forever. Here is a reproducible example:

library(data.table)
#data.table 1.8.8
set.seed(1) 

dataraw <- data.table(sig1 = runif(80000, 0, 9999),
                      sig2 = runif(80000, 0, 9999),
                      sig3 = runif(80000, 0, 9999))

system.time(transform(dataraw, d = 1))
#  user      system     elapsed 
#16.345       0.016      16.359 

dataraw2 <- as.data.frame(dataraw)

system.time(transform(dataraw2, d = 1))
# user      system     elapsed 
#0.002       0.002       0.005

为什么转换与data.frame相比使用data.table这么慢？

Why is transform so slow with a data.table in comparison to when used with a data.frame?

更新：在v1.8.10中已修复。来自新闻：

Update: This has been fixed long back, in v1.8.10. From NEWS:

o <$ c $上 transform（）的慢度c> data.table 已修复，＃2599 。但是，请使用：= 。

o The slowness of transform() on data.table has been fixed, #2599. But, please use :=.

尽管从文档和？transform.data.table （同样来自SenorO的帖子）中可以很明显地看出，习惯用法是使用：= （按引用分配），这非常快，我认为了解 为什么 转换仍然很有趣在 data.table 上速度较慢。到目前为止，根据我的理解， transform.data.table 并不总是较慢。

Although it's clear from the documentation and from ?transform.data.table (from SenorO's post as well) that the idiomatic way is to use := (assign by reference), which is incredibly fast, I think it's still interesting to know why transform is slower on data.table. From what I've managed to comprehend so far, transform.data.table is not always slower.

我会在这里尝试回答。每秒 transform.data.table 似乎不是问题，而是调用 data.table（）函数。通过查看 data.table ::: transform.data.table ，滞后现象来自以下行：

I'll make an attempt to answer that here. It doesn't seem to be a problem with transform.data.table per-se, rather in its call to data.table() function. By looking at data.table:::transform.data.table, the lag comes from the line:

ans <- do.call("data.table", c(list(`_data`), e[!matched]))

因此，让我们用一个大的 data.table 对该行进行基准测试，并按顺序排列值：

So, let's benchmark this line with a big data.table with values in order:

DT <- data.table(x=1:1e5, y=1:1e5, z=1:1e5)
system.time(do.call("data.table", c(list(DT), list(d=1))))
   user  system elapsed 
  0.003   0.003   0.026

哦，这太快了！让我们进行相同的基准测试，但是顺序使用 not 值：

Oh this is extremely fast! Let's benchmark the same, but with values not in order:

DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5))
system.time(do.call("data.table", c(list(DT), list(d=1))))

   user  system elapsed 
  7.986   0.016   8.099 

# tested on 1.8.8 and 1.8.9

它变慢了。是什么造成了这种差异？为此，我们必须调试 data.table（）函数。通过

It gets slow. What's causing this difference? To do that we'll have to debug data.table() function. By doing

DT <- data.table(x=as.numeric(1:1e5), y=as.numeric(1:1e5), z=as.numeric(1:1e5))
debugonce(data.table)
transform(DT, d=1)

并依次点击 enter ，您将能够找到这种缓慢的原因所在：

and by hitting "enter" successively, you'll be able to find the reason for the such slowness is at the line:

exptxt = as.character(tt) # roughly about 7.2 seconds

很明显， as.character 成为问题。为什么？为此，请比较：

It's clear that as.character becomes the issue. Why? To do this, compare:

as.character(data.frame(x=1:10, y=1:10))
# [1] "1:10" "1:10"

as.character(data.frame(x=sample(10), y=sample(10)))
# [1] "c(9, 10, 4, 7, 6, 5, 1, 3, 8, 2)" "c(8, 5, 3, 7, 6, 10, 9, 1, 4, 2)"

在更大的数据上重复此操作，以发现为采样的 data.frame 上的.character 变慢了。

Repeat this on bigger data to see that as.character on sampled data.frame gets slower.

现在，问题就变成了为什么

Now then, the question becomes, why isn't

data.table(x = sample(1e5), y=sample(1e5))

是否耗时？这是因为，给 data.table（）函数的输入是替换（带有 subsitute（））。在这种情况下， tt 变为：

time consuming? This is because, the input given to data.table() function is substituted (with subsitute()). In this case, tt becomes:

$x
sample(1e+05)

$y
sample(1e+05)

和字符（tt）然后变为：

# [1] "sample(1e+05)" "sample(1e+05)"

这意味着，如果您要这样做：

This means, if you were to do:

DT <- data.table(x = c(1,3,4,1,4,1,3,1,2...), y = c(1,1,4,1,3,4,1,1,3...))

我想这会花费很多时间（通常不会这样做，因此没有问题）

I'd suppose that this'll take a LOT of time (which one doesn't usually do and hence no issues).

这篇关于为什么transform.data.table比transform.data.frame慢得多？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么transform.data.table比transform.data.frame慢得多？ [英] Why is transform.data.table so much slower than transform.data.frame?

问题描述

推荐答案

更新：在v1.8.10中已修复。来自新闻：

Update: This has been fixed long back, in v1.8.10. From NEWS:

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么transform.data.table比transform.data.frame慢得多？ [英] Why is transform.data.table so much slower than transform.data.frame?

问题描述

推荐答案

更新：在v1.8.10中已修复。来自新闻：

Update: This has been fixed long back, in v1.8.10. From NEWS:

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭