使用 lapply() 优雅地分配 data.table 中的多个列 [英] Elegantly assigning multiple columns in data.table with lapply()

查看:14
本文介绍了使用 lapply() 优雅地分配 data.table 中的多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出一种优雅的方式来使用 := 赋值通过应用共享函数来一次替换 data.table 中的许多列.一个典型的使用可能是将字符串函数(例如,gsub)应用于表中的所有字符列.将 data.frame 的方式扩展到 data.table 并不难,但我正在寻找与 data 一致的方法.表做事方式.

I am trying to figure out an elegant way to use := assignment to replace many columns at once in a data.table by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub) to all character columns in a table. It is not difficult to extend the data.frame way of doing this to a data.table, but I'm looking for a method consistent with the data.table way of doing things.

例如:

library(data.table)

m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))

## replace V20-V100 with sqrt

# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)

# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)

我知道使用 := 循环遍历列名向量来分配更有效:

I understand it is more efficient to loop over a vector of column names using := to assign:

for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]

我不喜欢这样,因为我不喜欢在 j 表达式中引用 data.table.我也知道我可以使用 := 来分配 lapply ,因为我知道列名:

I don't like this because I don't like reference the data.table in a j expression. I also know that I can use := to assign with lapply given that I know the column names:

dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]

(您可以通过构建具有未知列名的表达式来扩展它.)

(You could extend this by building an expression with unknown column names.)

以下是我尝试过的想法,但我无法让它们发挥作用.我犯了一个错误,还是我错过了另一种方法?

Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?

# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]

推荐答案

是的,你的问题在这里:

Yes, you're right in question here :

我知道使用 := 循环遍历列名向量来分配更有效:

I understand it is more efficient to loop over a vector of column names using := to assign:

for (col in paste0("V", 20:100))dt[, col := sqrt(dt[[col]]), with = FALSE]

旁白:请注意,新的做法是:

Aside: note that the new way of doing that is :

for (col in paste0("V", 20:100))
  dt[ , (col) := sqrt(dt[[col]])]

因为 with = FALSE 无论是指 := 的 LHS 还是 RHS 都不容易阅读.放在一边.

because the with = FALSE wasn't easy to read whether it referred to the LHS or the RHS of :=. End aside.

如您所知,这很有效,因为它会一一处理每一列,因此一次只需要一列的工作内存.这可能会在它正常工作和因可怕的内存不足错误而失败之间产生差异.

As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.

:= 的 RHS 上 lapply 的问题是先评估 RHS(lapply);即,创建 80 列的结果.那是 80 列的新内存,必须分配和填充.因此,您需要 80 列的可用 RAM 才能使该操作成功.与随后将这 80 个新列分配(plonking)到 data.table 的列指针槽的即时操作相比,RAM 使用占主导地位.

The problem with lapply on the RHS of := is that the RHS (the lapply) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.

正如@Frank 所指出的,如果您有很多列(例如 10,000 或更多),那么调度到 [.data.table 方法的小开销开始累加).为了消除这种开销,在 ?set 下的 data.table::set 被描述为可循环的":=.我对这种类型的操作使用 for 循环.这是最快的方式,而且相当容易读写.

As @Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table method starts to add up). To eliminate that overhead that there is data.table::set which under ?set is described as a "loopable" :=. I use a for loop for this type of operation. It's the fastest way and is fairly easy to write and read.

for (col in paste0("V", 20:100))
  set(dt, j = col, value = sqrt(dt[[col]]))

虽然只有 80 列,但这不太重要.(请注意,循环 set 在大量行上可能比在大量列上更常见.)但是,循环 set 并不能解决问题重复引用您在问题中提到的 dt 符号名称:

Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set over a large number of rows than a large number of columns.) However, looped set doesn't solve the problem of the repeated reference to the dt symbol name that you mentioned in the question :

我不喜欢这样,因为我不喜欢在 j 表达式中引用 data.table.

I don't like this because I don't like reference the data.table in a j expression.

同意.所以我能做的最好的就是恢复到 := 的循环,但使用 get 代替.

Agreed. So the best I can do is revert to your looping of := but use get instead.

for (col in paste0("V", 20:100))
  dt[, (col) := sqrt(get(col))]

但是,我担心在 j 中使用 get 会产生开销.在 #1380 中进行了基准测试.此外,在 RHS 上使用 get() 而在 LHS 上使用可能会造成混淆.为了解决这个问题,我们可以添加 LHS 并允许 get()#1381 :

However, I fear that using get in j carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get() on the RHS but not on the LHS. To address that we could sugar the LHS and allow get() as well, #1381 :

for (col in paste0("V", 20:100))
  dt[, get(col) := sqrt(get(col))]

另外,也许setvalue可以在DT的范围内运行,#1382.

Also, maybe value of set could be run within scope of DT, #1382.

for (col in paste0("V", 20:100))
  set(dt, j = col, value = sqrt(get(col))

这篇关于使用 lapply() 优雅地分配 data.table 中的多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆