使用 lapply() 优雅地分配 data.table 中的多个列 [英] Elegantly assigning multiple columns in data.table with lapply()
问题描述
我试图找出一种优雅的方式来使用 :=
赋值通过应用共享函数来一次替换 data.table
中的许多列.一个典型的使用可能是将字符串函数(例如,gsub
)应用于表中的所有字符列.将 data.frame
的方式扩展到 data.table
并不难,但我正在寻找与 data 一致的方法.表
做事方式.
I am trying to figure out an elegant way to use :=
assignment to replace many columns at once in a data.table
by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub
) to all character columns in a table. It is not difficult to extend the data.frame
way of doing this to a data.table
, but I'm looking for a method consistent with the data.table
way of doing things.
例如:
library(data.table)
m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))
## replace V20-V100 with sqrt
# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)
# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)
我知道使用 :=
循环遍历列名向量来分配更有效:
I understand it is more efficient to loop over a vector of column names using :=
to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
我不喜欢这样,因为我不喜欢在 j
表达式中引用 data.table
.我也知道我可以使用 :=
来分配 lapply
,因为我知道列名:
I don't like this because I don't like reference the data.table
in a j
expression. I also know that I can use :=
to assign with lapply
given that I know the column names:
dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]
(您可以通过构建具有未知列名的表达式来扩展它.)
(You could extend this by building an expression with unknown column names.)
以下是我尝试过的想法,但我无法让它们发挥作用.我犯了一个错误,还是我错过了另一种方法?
Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?
# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]
推荐答案
是的,你的问题在这里:
Yes, you're right in question here :
我知道使用 :=
循环遍历列名向量来分配更有效:
I understand it is more efficient to loop over a vector of column names using
:=
to assign:
for (col in paste0("V", 20:100))dt[, col := sqrt(dt[[col]]), with = FALSE]
旁白:请注意,新的做法是:
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
因为 with = FALSE
无论是指 :=
的 LHS 还是 RHS 都不容易阅读.放在一边.
because the with = FALSE
wasn't easy to read whether it referred to the LHS or the RHS of :=
. End aside.
如您所知,这很有效,因为它会一一处理每一列,因此一次只需要一列的工作内存.这可能会在它正常工作和因可怕的内存不足错误而失败之间产生差异.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
:=
的 RHS 上 lapply
的问题是先评估 RHS(lapply
);即,创建 80 列的结果.那是 80 列的新内存,必须分配和填充.因此,您需要 80 列的可用 RAM 才能使该操作成功.与随后将这 80 个新列分配(plonking)到 data.table 的列指针槽的即时操作相比,RAM 使用占主导地位.
The problem with lapply
on the RHS of :=
is that the RHS (the lapply
) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
正如@Frank 所指出的,如果您有很多列(例如 10,000 或更多),那么调度到 [.data.table
方法的小开销开始累加).为了消除这种开销,在 ?set
下的 data.table::set
被描述为可循环的":=
.我对这种类型的操作使用 for
循环.这是最快的方式,而且相当容易读写.
As @Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table
method starts to add up). To eliminate that overhead that there is data.table::set
which under ?set
is described as a "loopable" :=
. I use a for
loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
虽然只有 80 列,但这不太重要.(请注意,循环 set
在大量行上可能比在大量列上更常见.)但是,循环 set
并不能解决问题重复引用您在问题中提到的 dt
符号名称:
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set
over a large number of rows than a large number of columns.) However, looped set
doesn't solve the problem of the repeated reference to the dt
symbol name that you mentioned in the question :
我不喜欢这样,因为我不喜欢在 j 表达式中引用 data.table.
I don't like this because I don't like reference the data.table in a j expression.
同意.所以我能做的最好的就是恢复到 :=
的循环,但使用 get
代替.
Agreed. So the best I can do is revert to your looping of :=
but use get
instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
但是,我担心在 j
中使用 get
会产生开销.在 #1380 中进行了基准测试.此外,在 RHS 上使用 get()
而在 LHS 上使用可能会造成混淆.为了解决这个问题,我们可以添加 LHS 并允许 get()
,#1381 :
However, I fear that using get
in j
carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get()
on the RHS but not on the LHS. To address that we could sugar the LHS and allow get()
as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
另外,也许set
的value
可以在DT
的范围内运行,#1382.
Also, maybe value
of set
could be run within scope of DT
, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
这篇关于使用 lapply() 优雅地分配 data.table 中的多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!