data.table参考语义:遍历所有列的内存使用情况 [英] data.table reference semantics: memory usage of iterating through all columns
问题描述
使用引用语义遍历R data.table中的所有列时,从内存使用的角度来看,更有意义的是:
(1) dt [,(all_cols):= lapply(.SD,my_fun)]
或
(2) lapply(colnames(dt),function(col)dt [,(col):= my_fun(dt [[col]])])[[1]]
我的问题是:在(2)中,我正在强制data.table覆盖 dt
在逐列的基础上进行,因此我假设需要按列大小顺序增加内存。 (1)也是这样吗?还是在覆盖原始列之前对所有 lapply(.SD,my_fun)
进行评估?
某些示例代码运行上述变体:
library(data.table)
dt<-data.table(a = 1:10,b = 11:20)
my_fun<-函数(x)x + 1
all_cols<-colnames(dt)
根据@Frank的建议,这是从内存角度来看最有效的替换数据的方法。通过对每个列应用函数 my_fun
逐列地查找表,是
库(data.table)
dt <--data.table(a = 1:10,b = 11:20)
my_fun<-函数(x)x + 1
all_cols<-colnames(dt)
for(col in all_cols)set(dt,j = col,value = my_fun(dt [[col]]))
当前(v1.11.4)的处理方式与 dt [,lapply( .SD,my_fun)]
dt [,list(fun(a),fun(b),...)]
,其中 a,b,...
是 .SD
中的列(请参见?datatable.optimize
)。将来这种情况可能会发生变化,#1414 会对其进行跟踪。 p>
When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:
(1) dt[, (all_cols) := lapply(.SD, my_fun)]
or
(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]
My question is: In (2), I am forcing data.table to overwrite dt
on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun)
evaluated before the original columns are overwritten?
Some sample code to run the above variants:
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
Following the suggestion of @Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun
to each column, is
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))
This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)]
which internally is optimised to dt[, list(fun(a), fun(b), ...)]
, where a, b, ...
are columns in .SD
(see ?datatable.optimize
). This might change in the future and is being tracked by #1414.
这篇关于data.table参考语义:遍历所有列的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!