在一个函数中通过引用向数据表中添加新列不总是工作 [英] Adding new columns to a data.table by-reference within a function not always working
问题描述
在编写依赖于 data.table
的软件包时,我发现了一些奇怪的行为。我有一个函数,删除和重新排序一些列的引用,它的工作正常,这意味着传递的 data.table
在没有分配函数输出的情况下修改。我有另一个函数添加 new 列,但是这些更改并不总是存在于传递的 data.table
中。
In writing a package which relies on data.table
, I've discovered some odd behavior. I have a function which removes and reorders some column by-reference, and it works just fine, meaning the data.table
I passed in was modified without assigning the function output. I have another function which adds new columns however, but those changes do not always persist in the data.table
which was passed in.
这是一个小例子:
library(data.table) # I'm using 1.9.4
test <- data.table(id = letters[1:2], val=1:2)
foobar <- function(dt, col) {
dt[, (col) := 1]
invisible(dt)
}
test
# id val
#1: a 1
#2: b 2
saveRDS(test, "test.rds")
test2 <- readRDS("test.rds")
all.equal(test, test2)
#[1] TRUE
foobar(test, "new")
test
# id val new
#1: a 1 1
#2: b 2 1
foobar(test2, "new")
test2
# id val
#1: a 1
#2: b 2
发生了什么事? test2
有什么不同?我可以在现场修改现有列:
What happened? What's different about test2
? I can modify existing columns in-place on either:
foobar(test, "val")
test
# id val new
#1: a 1 1
#2: b 1 1
foobar(test2, "val")
test2
# id val
#1: a 1
#2: b 1
test2
仍然无法工作:
foobar(test2, "someothercol")
.Last.value
# id val someothercol
#1: a 1 1
#2: b 1 1
test2
# id val
#1: a 1
#2: b 1
t打开我看到这种行为的所有情况,但保存到RDS并从中读取是第一种可以可靠复制的情况。写入和读取CSV似乎没有同样的问题。
I can't pin down all the cases where I see this behavior, but saving to and reading from RDS is the first case I can reliably replicate. Writing to and reading from a CSV doesn't seem to have the same problem.
这是一个指针问题ala 这个问题,比如序列化一个data.table会破坏过度分配的指针?有一个简单的方法来恢复他们?我如何在我的函数中检查他们,所以我可以恢复指针或错误,如果操作不工作?
Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers? Is there a simple way to restore them? How could I check for them inside my function, so I could restore the pointers or error if the operation isn't going to work?
我知道我可以分配函数输出作为解决方法,但这不是很多 data.table
-y。
I know I can assign the function output as a workaround, but that's not very data.table
-y. Wouldn't that also create a temporary copy in memory?
Arun指示它确实是一个指针问题,可以用 truelength
诊断并用 setDT
或 alloc.col
。我遇到了一个问题封装他的解决方案在一个函数(从上面的代码继续):
Arun has instructed that it is indeed a pointer issue, which can be diagnosed with truelength
and fixed with setDT
or alloc.col
. I ran into a problem encapsulating his solution in a function (continuing from above code):
func <- function(dt) {if (!truelength(dt)) setDT(dt)}
func2 <- function(dt) {if (!truelength(dt)) alloc.col(dt)}
test2 <- readRDS("test.rds")
truelength(test2)
#[1] 0
truelength(func(test2))
#[1] 100
truelength(test2)
#[1] 0
truelength(func2(test2))
#[1] 100
truelength(test2)
#[1] 0
所以看起来函数内部的本地副本被正确修改,但是引用版本不是。为什么不呢?
So it looks like the local copy inside the function is being properly modified, but the reference version is not. Why not?
推荐答案
这是一个指针问题ala这个问题,像序列化一个data.table销毁过度分配的指针?
Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers?
是从磁盘加载将外部指针设置为NULL。
Yes loading from disk sets the external pointer to NULL. We will have to over-allocate again.
有没有简单的方法来恢复?
Is there a simple way to restore them?
是的。您可以测试data.table的 truelength()
,如果 0
,则使用 setDT()
或 alloc.col()
。
Yes. You can test for truelength()
of the data.table, and if it's 0
, then use setDT()
or alloc.col()
on it.
truelength(test2) # [1] 0
if (!truelength(test2))
setDT(test2)
truelength(test2) # [1] 100
foobar(test2, "new")
test2[]
# id val new
# 1: a 1 1
# 2: b 2 1
这应该是一个常见问题。
已在常见问题在警告消息部分中。
This should probably go in as a FAQ (can't remember seeing it there).
Already in FAQ in Warning Messages section.
这篇关于在一个函数中通过引用向数据表中添加新列不总是工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!