为什么在分组 data.table 中的 lm 上使用更新会丢失其模型数据? [英] Why is using update on a lm inside a grouped data.table losing its model data?

查看:8
本文介绍了为什么在分组 data.table 中的 lm 上使用更新会丢失其模型数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,这很奇怪.我怀疑这是 data.table 中的一个错误,但如果有人能解释为什么会发生这种情况会很有用 - update 到底在做什么?

Ok, this is a weird one. I suspect this is a bug inside data.table, but it would be useful if anyone can explain why this is happening - what is update doing exactly?

我在 data.table 中使用 list(list()) 技巧来存储拟合模型.当您为不同的分组创建一系列 lm 对象,然后 update 这些模型时,所有模型的模型数据将成为最后一个分组的模型数据.这似乎是一个参考挂在某个应该制作副本的地方,但我找不到在哪里,也无法在 lmupdate 之外重现它.

I'm using the list(list()) trick inside data.table to store fitted models. When you create a sequence of lm objects each for different groupings, and then update those models, the model data for all models becomes that of the last grouping. This seems like a reference is hanging around somewhere where a copy should have been made, but I can't find where and I can't reproduce this outside of lm and update.

具体例子:

从鸢尾花数据开始,首先让三个物种的样本量不同,然后为每个物种拟合一个lm模型,更新这些模型:

Starting with the iris data, first make the three species different sample sizes, then fit an lm model to each species, the update those models:

set.seed(3)
DT = data.table(iris)
DT = DT[rnorm(150) < 0.9]
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))),
          by = Species]
fit2 = fit[, list(list(update(V1[[1]], ~.-Sepal.Length))), by = Species]

原始数据表中每个物种的数量不同

The original data table has different numbers of each species

DT[,.N, by = Species]
#       Species  N
# 1:     setosa 41
# 2: versicolor 39
# 3:  virginica 42

第一次拟合证实了这一点:

And the first fit confirms thsi:

fit[, nobs(V1[[1]]), by = Species]
#       Species V1
# 1:     setosa 41
# 2: versicolor 39
# 3:  virginica 42

但更新后的第二次拟合显示所有模型为 42

But the updated second fit is showing 42 for all models

fit2[, nobs(V1[[1]]), by = Species]
#       Species V1
# 1:     setosa 42
# 2: versicolor 42
# 3:  virginica 42

我们还可以查看包含用于拟合的数据的模型属性,并看到所有模型确实使用了最终组数据.问题是这是怎么发生的?

We can also look at the model attribute which contains the data used for fitting, and see that all the model are indeed using the final groups data. The question is how has this happened?

head(fit$V1[[1]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# 3          4.7         3.2          1.3
# 4          4.6         3.1          1.5
# 5          5.0         3.6          1.4
# 6          5.4         3.9          1.7
head(fit$V1[[3]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3
head(fit2$V1[[1]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3
head(fit2$V1[[3]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3

推荐答案

这不是答案,但评论太长了

This is not an answer, but is too long for a comment

terms 组件的 .Environment 对于每个结果模型都是相同的

The .Environment for the terms component is identical for each resulting model

e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE

似乎 data.table 正在使用相同的内存(我的非技术术语)按组对 j 的每次评估(这是有效的).但是,当调用 update 时,它正在使用它来重新拟合模型.这将包含最后一组的值.

It appears that data.table is using the same bit of memory (my non-technical term) for each evaluation of j by group (which is efficient). However when update is called, it is using this to refit the model. This will contain the values from the last group.

所以,如果你捏造这个,它会起作用的

So, if you fudge this, it will work

fit = DT[, { xx <-list2env(copy(.SD))

             mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
             attr(mymodel$terms, '.Environment') <- xx
             list(list(mymodel))}, by= 'Species']





lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
      Species V1
1:     setosa 41
2: versicolor 39
3:  virginica 42

不是一个长期的解决方案,但至少是一种解决方法.

not a long term solution, but at least a workaround.

这篇关于为什么在分组 data.table 中的 lm 上使用更新会丢失其模型数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆