为什么从data.table中选择列会导致复制? [英] Why does selecting column(s) from a data.table results in a copy?

查看:70
本文介绍了为什么从data.table中选择列会导致复制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎使用 [。data.table 从data.table中选择列会导致底层向量的副本。我说的是非常简单的列选择,顾名思义,在 j 中没有要计算的表达式,在 i 中没有要子集的行。 code>。更奇怪的是,data.frame中的列子集似乎没有任何副本。我正在使用data.table版本的data.table 1.10.4。下面提供了一个包含详细信息和基准的简单示例。我的问题是:

It appears that selecting column(s) from the data.table with [.data.table results in a copy of the underlying vector(s). I am talking about very simple column selection, by name, there are no expressions to compute in j and there are no rows to subset in i. Even more strangely, the column subsetting in a data.frame appears to not make any copies. I am using the data.table version data.table 1.10.4. A simple example with details and benchmarks is provided below. My questions are:


  • 我做错了吗?

  • 这是错误还是预期的行为?

  • 如果要这样做,最好的方法是按列将data.table子集化,并避免多余的复制?

预期的用例涉及大型数据集,因此必须避免额外的副本(特别是因为基数R似乎已经支持这一点) 。

The intended use-case involves large dataset, so avoiding extra copies is a must (especially since base R seems to already support this).

library(data.table)
set.seed(12345)
cpp_dt <- data.table(a = runif(1e6), b = rnorm(1e6), c = runif(1e6))
cols=c("a","c")

## naive / data.frame style of column selection
## leads to a copy of the column vectors in cols
subset_cols_1=function(dt,cols){
  return(dt[,cols,with=F])
}

## alternative syntax, still results in a copy
subset_cols_2=function(dt,cols){
  return(dt[,..cols])
}

## work-around that uses data.frame column selection,
## appears to avoid the copy
subset_cols_3=function(dt,cols){
  setDF(dt)
  subset=dt[,cols]
  setDT(subset)
  setDT(dt)
  return(subset)
}

## another approach that makes a "shallow" copy of the data.table
## then NULLs the not needed columns by reference
## appears to also avoid the copy
subset_cols_4=function(dt,cols){
  subset=dt[TRUE]
  other_cols=setdiff(names(subset),cols)
  set(subset,j=other_cols,value=NULL)
  return(subset)
}

subset_1=subset_cols_1(cpp_dt,cols)
subset_2=subset_cols_2(cpp_dt,cols)
subset_3=subset_cols_3(cpp_dt,cols)
subset_4=subset_cols_4(cpp_dt,cols)

现在让我们看一下内存分配并与原始数据进行比较。

Now lets look at the memory allocation and compare to original data.

.Internal(inspect(cpp_dt)) # original data, keep an eye on 1st and 3d vector
# @7fe8ba278800 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=3, tl=1027)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f1a3000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) -0.947317,-0.636669,0.167872,-0.206986,0.411445,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

使用 [。data.table 方法对列进行子集化:

Using [.data.table method to subset the columns:

.Internal(inspect(subset_1)) # looks like data.table is making a copy
# @7fe8b9f3b800 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
#   @114cb0000 14 REALSXP g0c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @1121ca000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

另一个语法版本仍使用 [。data.table 并仍在复制:

Another syntax version that still uses [.data.table and still making a copy:

.Internal(inspect(subset_2)) # same, still copy
# @7fe8b6402600 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
#   @115452000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @1100e7000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

使用 setDF 的序列,后跟 [。 data.frame setDT 。看,向量 a c 不再被复制!看来基本的R方法更有效/具有更小的内存占用吗?

Using a sequence of setDF, followed by [.data.frame and setDT. Look, the vectors a and c are no longer copied! It appears that base R method is more efficient / has smaller memory footprint?

.Internal(inspect(subset_3)) # "[.data.frame" is not making a copy!!
# @7fe8b633f400 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1026)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

另一种方法是制作data.table的浅表副本,然后通过引用新data.table中的所有额外列为NULL。再次没有副本。

Another approach is to make a shallow copy of the data.table, then NULL all the extra columns by reference in the new data.table. Again no copies are made.

.Internal(inspect(subset_4)) # 4th approach seems to also avoid the copy
# @7fe8b924d800 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1027)
#   @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
#   @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]

现在让我们看一下这四种方法的基准。看起来 [.data.frame( subset_cols_3 )显然是赢家。

Now lets look at the benchmarks of these four approaches. It looks like "[.data.frame" (subset_cols_3) is a clear winner.

microbenchmark({subset_cols_1(cpp_dt,cols)},
               {subset_cols_2(cpp_dt,cols)},
               {subset_cols_3(cpp_dt,cols)},
               {subset_cols_4(cpp_dt,cols)},
               times=100)

# Unit: microseconds
#                                 expr      min        lq      mean   median        uq       max neval
#  {     subset_cols_1(cpp_dt, cols) } 4772.092 5128.7395 8956.7398 7149.447 10189.397 53117.358   100
#  {     subset_cols_2(cpp_dt, cols) } 4705.383 5107.1690 8977.1816 6680.666  9206.164 53523.191   100
#  {     subset_cols_3(cpp_dt, cols) }  148.659  177.9595  285.4926  250.620   283.414  4422.968   100
#  {     subset_cols_4(cpp_dt, cols) }  193.912  241.9010  531.8308  336.467   384.844 20061.864   100


推荐答案

自从我想到abou已经有一段时间了

It's been a while since I thought about this, but here goes.

很好的问题。但是,为什么需要像这样子化 data.table 呢?我们确实需要看看您在做什么 next :更大的前景。这是一张更大的图片,我们可能在data.table中使用的方法与基本R习惯用法不同。

Good question. But why do you need to subset a data.table like that? We really need to see what you are doing next: the bigger picture. It's that bigger picture that we probably have a different way for in data.table than the base R idiom.

用一个不好的例子粗略地举例说明:

Roughly illustrating with probably a bad example :

DT[region=="EU", lapply(.SD, sum), .SDcols=10:20]

而不是获取子集然后执行下一步操作的基本R习惯用法(此处,应用)以外的结果:

rather than the base R idiom of taking a subset and then doing something next (here, apply) on the result outside :

apply(DT[DT$region=="EU", 10:20], 2, sum)

通常,我们希望鼓励在一个内部做得尽可能多 [...] ,以便data.table看到 i j by 一起执行一次 [...] 操作,并且可以优化组合。当您对列进行子集化,然后再在外面执行下一步操作时,需要更多的软件复杂性来进行优化。在大多数情况下,大部分计算成本都在第一个 [...] 内,这会减少到相对较小的规模。

In general, we want to encourage doing as much as possible inside one [...] so that data.table sees the i, j and by together in one [...] operation and can optimize the combination. When you subset columns and then do the next thing outside afterwards it requires more software complexity to optimize. In most cases, most of the computational cost is inside the first [...] which reduces to a relatively insignificant size.

话虽如此,除了弗兰克(Frank)关于的评论外,我们还在等待看看 ALTREP项目推出。这样可以改善基数R中的引用计数,并且可以使:= 可靠地知道其操作的列是否需要先写时复制。当前,:= 始终通过引用进行更新,因此将同时更新两个data.table的内容,如果全选列不进行深层复制(故意这样做)出于这个原因)。如果在 [...] 中未使用:= ,则 [...] 总是返回一个新的结果,可以安全地使用:= ,这是当前非常简单的规则。即使您正在做的只是出于某种原因选择了几整列。

With that said, in addition to Frank's comment about shallow, we're also waiting to see how the ALTREP project pans out. That improves reference counting in base R and may enable := to know reliably whether a column it is operating on needs to be copy-on-write first or not. Currently,:= always updates by reference so it would update both data.table's if selecting-some-whole-columns did not take a deep copy (it is deliberate that it does copy, for that reason). If := is not used inside [...] then [...] always returns a new result which is safe to use := on, which is quite a straightforward rule currently. Even if all you're doing is selecting a few whole columns for some reason.

我们真的需要看大图:以后您在做什么列的子集。明确说明将有助于提高调查ALTREP的优先级,或者可能对此案件进行我们自己的参考计数。

We really need to see the bigger picture please: what you're doing afterwards on the subset of columns. Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count for this case.

这篇关于为什么从data.table中选择列会导致复制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆