为什么 data.table 通过引用更新名称(DT),即使我分配给另一个变量? [英] Why does data.table update names(DT) by reference, even if I assign to another variable?

查看:24
本文介绍了为什么 data.table 通过引用更新名称(DT),即使我分配给另一个变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将 data.table 的名称存储为 vector:

library(data.table)set.seed(42)DT <- data.table(x = runif(100), y = runif(100))名称 1 <- 名称(DT)

据我所知,这是一个普通的普通字符向量:

str(names1)# chr [1:2] "x" "y"班级(名称1)# [1] "字符"dput(names1)# c("x", "y")

然而,这不是普通的字符向量.这是一个神奇的字符向量!当我向 data.table 添加新列时,此向量会更新!

DT[, z := runif(100)]姓名1# [1] "x" "y" "z"

我知道这与 := 如何通过赋值更新有关,但这对我来说仍然很神奇,因为我希望 <-data.table 名称的 em>copy.

我可以通过将名称包装在 c() 中来解决这个问题:

library(data.table)set.seed(42)DT <- data.table(x = runif(100), y = runif(100))名称 1 <- 名称(DT)名称2 <- c(名称(DT))all.equal(names1, names2)# [1] 真DT[, z := runif(100)]姓名1# [1] "x" "y" "z"姓名2# [1] "x" "y"

我的问题有两个:

  1. 为什么names1 <- names(DT) 不创建data.table 名称的副本?在其他情况下,我们明确警告 <- 会创建 data.tabledata.frame 的副本.
  2. names1 <- names(DT)names2 <- c(names(DT)) 有什么区别?

解决方案

更新:现在在 1.9.3 版的 ?copy 文档中添加了此内容.来自新闻:

<块引用>

  1. ?copy 移动到它自己的帮助页面,并记录了 dt_names <- copy(names(DT)) 对于 dt_names 是必要的> 作为通过引用更新 DT 的结果而不是通过引用修改(例如:通过引用添加新列).关闭 #512.感谢 Zach 这个 SO 问题 和 user1971988 这个问题.


你的第一个问题的一部分让对我有点不清楚你对 <- 运算符的真正含义(至少在 的上下文中)data.table),尤其是这部分:在其他情况下,我们明确警告<- 创建 data.tables 和 data.frames 的副本.

因此,在回答您的实际问题之前,我将在这里简要介绍一下.在data.table 的情况下,<-(赋值)仅不足以复制data.table.例如:

DT <- data.table(x = 1:5, y= 6:10)# 将 DT2 分配给 DTDT2 <- DT # 通过引用分配,不复制.DT2[, z := 11:15]# DT 也会有 z 列

如果您想创建一个copy,那么您必须使用copy 命令明确提及它.

DT2 <- copy(DT) # 复制内容到 DT2DT2[, z := 11:15] # 只影响 DT2

从 CauchyDistributedRV,我理解您的意思是分配 names(dt) <- . 会导致警告.我就这样吧.


现在,回答你的第一个问题:names1 <- names(DT) 似乎也有类似的行为.直到现在我才想到/知道这一点..Internal(inspect(.)) 命令在这里非常有用:

.Internal(inspect(names1))# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)# @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [缓存] "x";# @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [缓存] "y";.内部(检查(名称(DT)))# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)# @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [缓存] "x";# @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [缓存] "y";

在这里,您会看到它们指向相同的内存位置 @7fc86a851480.甚至names1truelength也是100(默认分配在data.table,查看?alloc.col> 为此).

truelength(names1)# [1] 100

所以基本上,赋值 names1 <- names(dt) 似乎是通过引用发生的.也就是说,names1 指向与 dt 的列名指针相同的位置.

回答你的第二个问题:命令 c(.) 似乎创建了一个副本因为没有检查内容是否由于连接操作不同.也就是说,因为 c(.) 操作可以改变向量的内容,它立即导致复制"向量.没有检查内容是否被修改.

I've stored the names of a data.table as a vector:

library(data.table)
set.seed(42)
DT <- data.table(x = runif(100), y = runif(100))
names1 <- names(DT)

As far as I can tell, it's a plain vanilla character vector:

str(names1)
# chr [1:2] "x" "y"

class(names1)
# [1] "character"

dput(names1)
# c("x", "y")

However, this is no ordinary character vector. It's a magic character vector! When I add a new column to my data.table, this vector gets updated!

DT[ , z := runif(100)]
names1
# [1] "x" "y" "z"

I know this has something to do with how := updates by assignment, but this still seems magic to me, as I expect <- to make a copy of the data.table's names.

I can fix this by wrapping the names in c():

library(data.table)
set.seed(42)
DT <- data.table(x = runif(100), y = runif(100))

names1 <- names(DT)
names2 <- c(names(DT))
all.equal(names1, names2)
# [1] TRUE

DT[ , z := runif(100)]
names1
# [1] "x" "y" "z"

names2
# [1] "x" "y"

My question is 2-fold:

  1. Why doesn't names1 <- names(DT) create a copy of the data.table's names? In other instances, we are explicitly warned that <- creates copies, both of data.tables and data.frames.
  2. What's the difference between names1 <- names(DT) and names2 <- c(names(DT))?

解决方案

Update: This is now added in the documentation for ?copy in version 1.9.3. From NEWS:

  1. Moved ?copy to it's own help page, and documented that dt_names <- copy(names(DT)) is necessary for dt_names to be not modified by reference as a result of updating DT by reference (ex: adding a new column by reference). Closes #512. Thanks to Zach for this SO question and user1971988 for this SO question.


Part of your first question makes it a bit unclear to me as to what you really mean about <- operator (at least in the context of data.table), especially the part: In other instances, we are explicitly warned that <- creates copies, both of data.tables and data.frames.

So, before answering your actual question, I'll briefly touch it here. In case of a data.table a <- (assignment) merely is not sufficient for copying a data.table. For example:

DT <- data.table(x = 1:5, y= 6:10)
# assign DT2 to DT
DT2 <- DT # assign by reference, no copy taken.
DT2[, z := 11:15]
# DT will also have the z column

If you want to create a copy, then you've to explicitly mention it using copy command.

DT2 <- copy(DT) # copied content to DT2
DT2[, z := 11:15] # only DT2 is affected

From CauchyDistributedRV, I understand what you mean is the assignment names(dt) <- . that'll result in the warning. I'll leave it as such.


Now, to answer your first question: It seems that names1 <- names(DT) also behaves similarly. I hadn't thought/known about this until now. The .Internal(inspect(.)) command is very useful here:

.Internal(inspect(names1))
# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
#   @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
#   @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "y"

.Internal(inspect(names(DT)))
# @7fc86a851480 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
#   @7fc86a069f68 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
#   @7fc86a0f96d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "y"

Here, you see that they are pointing to the same memory location @7fc86a851480. Even the truelength of names1 is 100 (which is by default allocated in data.table, check ?alloc.col for this).

truelength(names1)
# [1] 100

So basically, the assignment names1 <- names(dt) seems to happen by reference. That is, names1 is pointing to the same location as dt's column names pointer.

To answer your second question: The command c(.) seems to create a copy as there is no checking as to whether the contents result due to concatenation operation are different. That is, because c(.) operation can change the contents of the vector, it immediately results in a "copy" being made without checking if the contents are modified are not.

这篇关于为什么 data.table 通过引用更新名称(DT),即使我分配给另一个变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆