为什么在分组的data.table中的lm上使用update失去其模型数据? [英] Why is using update on a lm inside a grouped data.table losing its model data?

查看:131
本文介绍了为什么在分组的data.table中的lm上使用update失去其模型数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,这是一个奇怪的。我怀疑这是 data.table 中的一个错误,但是如果任何人可以解释为什么会发生这种情况,什么是更新



我正在使用列表(list()) c $ c> data.table 以存储拟合模型。当为不同分组创建 lm 对象序列,然后更新这些模型时,所有模型数据模型成为最后一个分组的模型。这似乎是一个参考挂在某个地方应该有一个副本,但我不能找到在哪里和我不能在 lm 和<$具体示例:



从虹膜数据开始,首先使三种不同的样本大小,然后为每个物种拟合 lm 模型,更新这些模型:

  set.seed(3)
DT = data.table(iris)
DT = DT [rnorm(150) ; 0.9]
fit = DT [,list(list(lm(Sepal.Length〜Sepal.Width + Petal.Length))),
by = Species]
fit2 = fit [,list (list(update(V1 [[1]],〜。-Sepal.Length))),by = Species]

原始数据表中每个物种的数量不同

  DT [,。N,by = Species] 
#种类N
#1:setosa 41
#2:versicolor 39
#3:virginica 42

第一个合适确认thsi:

  fit [,nobs [1]]),by = Species] 
#Species V1
#1:setosa 41
#2:versicolor 39
#3:virginica 42

但是更新的第二个合适显示所有模型中的42。

  fit2 [,nobs(V1 [[1]]),by = Species] 
#Species V1
#1:setosa 42
#2:versicolor 42
#3:virginica 42

我们还可以查看包含数据的model属性用于拟合,并且看到所有的模型确实使用最终组数据。

  head(fit $ V1 [[1]] $ model)
# Sepal.Length Sepal.Width Petal.Length
#1 5.1 3.5 1.4
#2 4.9 3.0 1.4
#3 4.7 3.2 1.3
#4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
#6 5.4 3.9 1.7
head(fit $ V1 [[3]] $ model)
#Sepal.Length Sepal.Width Petal.Length
#1 6.3 3.3 6.0
#2 5.8 2.7 5.1
#3 6.3 2.9 5.6
#4 7.6 3.0 6.6
#5 4.9 2.5 4.5
#6 7.3 2.9 6.3
head(fit2 $ V1 [[1]] $ model)
#Sepal.Length Sepal.Width Petal.Length
#1 6.3 3.3 6.0
#2 5.8 2.7 5.1
#3 6.3 2.9 5.6
#4 7.6 3.0 6.6
#5 4.9 2.5 4.5
#6 7.3 2.9 6.3
head(fit2 $ V1 [[3]] $ model)
#Sepal.Length Sepal.Width Petal.Length
#1 6.3 3.3 6.0
#2 5.8 2.7 5.1
#3 6.3 2.9 5.6
#4 7.6 3.0 6.6
#5 4.9 2.5 4.5
#6 7.3 2.9 6.3


解决方案

这不是一个答案,但对注释来说太长了。



.Environment

  e1  e2 < -  attr(fit [['V1']] [[2]] $ terms,'.Environment')
e3& - attr(fit [['V1']] [[3]] $ terms,'.Environment')
identical(e1,e2)
## TRUE
identical(e2,e3 )
## TRUE

看起来 data.table 正在使用相同的位存储器(非技术术语)
j 组(这是高效的)。但是,当调用 update 时,它正在使用此来重新编译模型。



所以,如果你这样做,它会工作

  fit = DT [,{xx <-list2env(copy(.SD))

mymodel <-lm(Sepal.Length〜Sepal.Width + Petal .Length)
attr(mymodel $ terms,'.Environment')< - xx
list(list(mymodel))},by ='Species']

b


$ b lfit2< - fit [,list(list(update(V1 [[1]],〜.- Sepal.Width))),by = Species]
lfit2 [,lapply(V1,nobs)]
V1 V2 V3
1:41 39 42
#使用您的确切诊断编码。
lfit2 [,nobs(V1 [[1]]),by = Species]
种类V1
1:setosa 41
2:versicolor 39
3:virginica 42

不是长期解决方案,但至少是解决方法。


Ok, this is a weird one. I suspect this is a bug inside data.table, but it would be useful if anyone can explain why this is happening - what is update doing exactly?

I'm using the list(list()) trick inside data.table to store fitted models. When you create a sequence of lm objects each for different groupings, and then update those models, the model data for all models becomes that of the last grouping. This seems like a reference is hanging around somewhere where a copy should have been made, but I can't find where and I can't reproduce this outside of lm and update.

Concrete example:

Starting with the iris data, first make the three species different sample sizes, then fit an lm model to each species, the update those models:

set.seed(3)
DT = data.table(iris)
DT = DT[rnorm(150) < 0.9]
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))),
          by = Species]
fit2 = fit[, list(list(update(V1[[1]], ~.-Sepal.Length))), by = Species]

The original data table has different numbers of each species

DT[,.N, by = Species]
#       Species  N
# 1:     setosa 41
# 2: versicolor 39
# 3:  virginica 42

And the first fit confirms thsi:

fit[, nobs(V1[[1]]), by = Species]
#       Species V1
# 1:     setosa 41
# 2: versicolor 39
# 3:  virginica 42

But the updated second fit is showing 42 for all models

fit2[, nobs(V1[[1]]), by = Species]
#       Species V1
# 1:     setosa 42
# 2: versicolor 42
# 3:  virginica 42

We can also look at the model attribute which contains the data used for fitting, and see that all the model are indeed using the final groups data. The question is how has this happened?

head(fit$V1[[1]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# 3          4.7         3.2          1.3
# 4          4.6         3.1          1.5
# 5          5.0         3.6          1.4
# 6          5.4         3.9          1.7
head(fit$V1[[3]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3
head(fit2$V1[[1]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3
head(fit2$V1[[3]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3

解决方案

This is not an answer, but is too long for a comment

The .Environment for the terms component is identical for each resulting model

e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE

It appears that data.table is using the same bit of memory (my non-technical term) for each evaluation of j by group (which is efficient). However when update is called, it is using this to refit the model. This will contain the values from the last group.

So, if you fudge this, it will work

fit = DT[, { xx <-list2env(copy(.SD))

             mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
             attr(mymodel$terms, '.Environment') <- xx
             list(list(mymodel))}, by= 'Species']





lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
      Species V1
1:     setosa 41
2: versicolor 39
3:  virginica 42

not a long term solution, but at least a workaround.

这篇关于为什么在分组的data.table中的lm上使用update失去其模型数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆