data.table参考语义：遍历所有列的内存使用情况 [英] data.table reference semantics: memory usage of iterating through all columns

查看：62 发布时间：2020/10/15 21:00:13 r data.table pass-by-reference

本文介绍了data.table参考语义：遍历所有列的内存使用情况的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用引用语义遍历R data.table中的所有列时，从内存使用的角度来看，更有意义的是：

（1） dt [，（all_cols）：= lapply（.SD，my_fun）]

或

（2） lapply（colnames（dt），function（col）dt [，（col）：= my_fun（dt [[col]]）]）[[1]]

我的问题是：在（2）中，我正在强制data.table覆盖 dt 在逐列的基础上进行，因此我假设需要按列大小顺序增加内存。（1）也是这样吗？还是在覆盖原始列之前对所有 lapply（.SD，my_fun）进行评估？

某些示例代码运行上述变体：

  library（data.table）
 dt<-data.table（a = 1:10，b = 11:20）
 my_fun<-函数（x）x + 1 
 all_cols<-colnames（dt）

解决方案

根据@Frank的建议，这是从内存角度来看最有效的替换数据的方法。通过对每个列应用函数 my_fun 逐列地查找表，是

 库（data.table）
 dt <--data.table（a = 1:10，b = 11:20）
 my_fun<-函数（x）x + 1 
 all_cols<-colnames（dt）
 
 for（col in all_cols）set（dt，j = col，value = my_fun（dt [[col]]））

当前（v1.11.4）的处理方式与 dt [，lapply（ .SD，my_fun）] dt [，list（fun（a），fun（b），...）] ，其中 a，b，... 是 .SD 中的列（请参见？datatable.optimize ）。将来这种情况可能会发生变化，＃1414 会对其进行跟踪。 p>

When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:

(1) dt[, (all_cols) := lapply(.SD, my_fun)]

(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]

My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?

Some sample code to run the above variants:

library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)

解决方案

Following the suggestion of @Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is

library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)

for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))

This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.

这篇关于data.table参考语义：遍历所有列的内存使用情况的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

data.table参考语义：遍历所有列的内存使用情况 [英] data.table reference semantics: memory usage of iterating through all columns

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

data.table参考语义：遍历所有列的内存使用情况 [英] data.table reference semantics: memory usage of iterating through all columns

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭