data.table参考语义:遍历所有列的内存使用情况 [英] data.table reference semantics: memory usage of iterating through all columns

查看:62
本文介绍了data.table参考语义:遍历所有列的内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用引用语义遍历R data.table中的所有列时,从内存使用的角度来看,更有意义的是:



(1) dt [,(all_cols):= lapply(.SD,my_fun)]





(2) lapply(colnames(dt),function(col)dt [,(col):= my_fun(dt [[col]])])[[1]]



我的问题是:在(2)中,我正在强制data.table覆盖 dt 在逐列的基础上进行,因此我假设需要按列大小顺序增加内存。 (1)也是这样吗?还是在覆盖原始列之前对所有 lapply(.SD,my_fun)进行评估?



某些示例代码运行上述变体:

  library(data.table)
dt<-data.table(a = 1:10,b = 11:20)
my_fun<-函数(x)x + 1
all_cols<-colnames(dt)


解决方案

根据@Frank的建议,这是从内存角度来看最有效的替换数据的方法。通过对每个列应用函数 my_fun 逐列地查找表,是

 库(data.table)
dt <--data.table(a = 1:10,b = 11:20)
my_fun<-函数(x)x + 1
all_cols<-colnames(dt)

for(col in all_cols)set(dt,j = col,value = my_fun(dt [[col]]))

当前(v1.11.4)的处理方式与 dt [,lapply( .SD,my_fun)] dt [,list(fun(a),fun(b),...)] ,其中 a,b,... .SD 中的列(请参见?datatable.optimize )。将来这种情况可能会发生变化,#1414 会对其进行跟踪。 p>

When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:

(1) dt[, (all_cols) := lapply(.SD, my_fun)]

or

(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]

My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?

Some sample code to run the above variants:

library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)

解决方案

Following the suggestion of @Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is

library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)

for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))

This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.

这篇关于data.table参考语义:遍历所有列的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆