在函数中通过引用向 data.table 添加新列并不总是有效 [英] Adding new columns to a data.table by-reference within a function not always working

查看:18
本文介绍了在函数中通过引用向 data.table 添加新列并不总是有效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在编写依赖于 data.table 的包时,我发现了一些奇怪的行为.我有一个按引用删除和重新排序某些列的函数,它工作得很好,这意味着我传入的 data.table 被修改而没有分配函数输出.我有另一个添加 new 列的函数,但是这些更改并不总是保留在传入的 data.table 中.

In writing a package which relies on data.table, I've discovered some odd behavior. I have a function which removes and reorders some column by-reference, and it works just fine, meaning the data.table I passed in was modified without assigning the function output. I have another function which adds new columns however, but those changes do not always persist in the data.table which was passed in.

这是一个小例子:

library(data.table)  # I'm using 1.9.4
test <- data.table(id = letters[1:2], val=1:2)
foobar <- function(dt, col) {
    dt[, (col) := 1]
    invisible(dt)
}

test
#  id val
#1: a   1
#2: b   2
saveRDS(test, "test.rds")
test2 <- readRDS("test.rds")
all.equal(test, test2)
#[1] TRUE
foobar(test, "new")
test
#  id val new
#1: a   1   1
#2: b   2   1
foobar(test2, "new")
test2
#  id val
#1: a   1
#2: b   2

发生了什么?test2 有什么不同?我可以就地修改现有列:

What happened? What's different about test2? I can modify existing columns in-place on either:

foobar(test, "val")
test
#  id val new
#1: a   1   1
#2: b   1   1
foobar(test2, "val")
test2
#  id val
#1: a   1
#2: b   1

但添加到 test2 仍然不起作用:

But adding to test2 still doesn't work:

foobar(test2, "someothercol")
.Last.value
#  id val someothercol
#1: a   1            1
#2: b   1            1
test2
#  id val
#1: a   1
#2: b   1

我无法确定我看到此行为的所有情况,但保存到 RDS 和从 RDS 读取是我可以可靠复制的第一个情况.写入和读取 CSV 似乎没有同样的问题.

I can't pin down all the cases where I see this behavior, but saving to and reading from RDS is the first case I can reliably replicate. Writing to and reading from a CSV doesn't seem to have the same problem.

这是一个指针问题吗 这个问题,比如序列化 data.table 会破坏过度分配的指针?有没有简单的方法来恢复它们?如何在我的函数中检查它们,以便在操作不起作用时恢复指针或错误?

Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers? Is there a simple way to restore them? How could I check for them inside my function, so I could restore the pointers or error if the operation isn't going to work?

我知道我可以将函数输出分配为一种解决方法,但这不是很data.table-y.那不是也会在内存中创建一个临时副本吗?

I know I can assign the function output as a workaround, but that's not very data.table-y. Wouldn't that also create a temporary copy in memory?

Arun 指示确实是指针问题,可以用truelength 诊断,用setDTalloc.col 修复.我遇到了一个问题,将他的解决方案封装在一个函数中(继续上面的代码):

Arun has instructed that it is indeed a pointer issue, which can be diagnosed with truelength and fixed with setDT or alloc.col. I ran into a problem encapsulating his solution in a function (continuing from above code):

func <- function(dt) {if (!truelength(dt)) setDT(dt)}
func2 <- function(dt) {if (!truelength(dt)) alloc.col(dt)}
test2 <- readRDS("test.rds")
truelength(test2)
#[1] 0
truelength(func(test2))
#[1] 100
truelength(test2)
#[1] 0
truelength(func2(test2))
#[1] 100
truelength(test2)
#[1] 0

所以看起来函数内的本地副本正在被正确修改,但参考版本不是.为什么不呢?

So it looks like the local copy inside the function is being properly modified, but the reference version is not. Why not?

推荐答案

这是一个指针问题还是这个问题,比如序列化 data.table 会破坏过度分配的指针?

Is this a pointer issue ala this issue, like serializing a data.table destroys the over-allocated pointers?

是的,从磁盘加载会将外部指针设置为 NULL.我们将不得不再次过度分配.

Yes loading from disk sets the external pointer to NULL. We will have to over-allocate again.

有没有简单的方法来恢复它们?

Is there a simple way to restore them?

是的.您可以测试data.table的truelength(),如果是0,则使用setDT()alloc.col() 就可以了.

Yes. You can test for truelength() of the data.table, and if it's 0, then use setDT() or alloc.col() on it.

truelength(test2) # [1] 0
if (!truelength(test2))
    setDT(test2)
truelength(test2) # [1] 100

foobar(test2, "new")
test2[]
#    id val new
# 1:  a   1   1
# 2:  b   2   1

这可能应该作为常见问题解答(不记得在那里看到过).
已在警告消息部分的 FAQ 中.

这篇关于在函数中通过引用向 data.table 添加新列并不总是有效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆