R - 在data.table中使用glm [英] R - using glm inside a data.table

查看:196
本文介绍了R - 在data.table中使用glm的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我一直在做这个成功的方法:

我试图在data.table中使用一些glm来生成由关键因素分割的模型结果。 / p>


  • 高级glm



    glm(modellingDF,formula = Outcome〜 IntCol + DecCol,family = binomial(link = logit))


  • 具有单列的范围glm



    modellingDF [,list(Outcome,
    fitted = glm(x,formula = Outcome〜IntCol,family = binomial(link = logit))$ fits),
    by = variable]

    / li>
  • 具有两个整数列的Scoped glm



    modellingDF [,list(Outcome,
    fitted = glm ,formula = Outcome〜IntCol + IntCol2,family = binomial(link = logit))$ fitted),
    by = variable]




但是,当我尝试在范围内使用我的十进制列执行高级glm时,会产生此错误。

  model.frame.default中的错误(公式= Outcome〜IntCol + DecCol,data = x,:
变量长度不同(对于'DecCol')



我想也许是因为分区的长度可变,所以我用一个可重现的例子测试:

  library(data.table)

testing <-data.table(letters = sample(rep(LETTERS,5000),5000)
letters2 = sample(rep(LETTERS [1:5],10000),5000),
cont.var = rnorm(5000),
cont.var2 = round * 1000,0),
outcome = rbinom(5000,1,0.8)
,key =letters)
testing.glm< -testing [,list(outcome,
fit = glm(x,formula = outcome_cont.var + cont.var2,family = binomial(link = logit))$ fitted)
),by = list code>

但这没有错误。我认为也许是由于NAs或东西,但data.table modellingDF的总结没有指示应该有任何问题:

  DecCol 
最小。 :0.0416
第一次查询:0.6122
中值:0.7220
平均值:0.6794
第三个查询:0.7840
最大。 :0.9495

nrow(modellingDF [is.na(DecCol),])#results in 0

modellingDF [,list(len = .N,DecCollen = length ),IntCollen = length
(IntCol),Outcomelen = length(Outcome)),by = Bracket]

Bracket len DecCollen IntCollen Outcomelen
1:3-6 39184 39184 39184 39184
2:1-2 19909 19909 19909 19909
3:0 9912 9912 9912 9912


$ b b

也许我有一个舒适的一天,但任何人都可以提出解决方案或手段进一步挖掘这个问题。

解决方案

您需要在 glm 中正确指定数据参数。在 data.table (使用 [)中引用 .SD 。 (请参阅在data.table环境中创建公式R 的相关问题)



因此

  modellingDF [,list(Outcome,fitted = glm(data = .SD,
formula = Outcome〜IntCol,family = binomial(link = logit))$ fits),
by = variable]



在这种情况下(简单提取这个方法是合理的,使用 data.table .SD 可以得到一个如果您保存整个模型,然后尝试更新(参见


I'm trying to do some glm's inside a data.table to produce modelled results split by key factors.

I've been doing this sucessfully for:

  • High level glm

    glm(modellingDF,formula=Outcome~IntCol + DecCol,family=binomial(link=logit))

  • Scoped glm with single columns

    modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol ,family=binomial(link=logit))$fitted ), by=variable]

  • Scoped glm with two integer columns

    modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol + IntCol2 ,family=binomial(link=logit))$fitted ), by=variable]

But, when I try and do the high level glm inside the scope with my decimal column, it produces this error

Error in model.frame.default(formula = Outcome ~ IntCol + DecCol, data = x,  : 
  variable lengths differ (found for 'DecCol')

I thought perhaps it was due to variable lengths of the partitions, so I tested with a reproducible example:

library("data.table")

testing<-data.table(letters=sample(rep(LETTERS,5000),5000),
                    letters2=sample(rep(LETTERS[1:5],10000),5000), 
                    cont.var=rnorm(5000),
                    cont.var2=round(rnorm(5000)*1000,0),
                    outcome=rbinom(5000,1,0.8)
                    ,key="letters")
testing.glm<-testing[,list(outcome,
                  fitted=glm(x,formula=outcome~cont.var+cont.var2,family=binomial(link=logit))$fitted)
        ),by=list(letters)]

But this did not have the error. I thought maybe it was due to NAs or something but a summary of the data.table modellingDF gives no indication that there should be any issues:

DecCol
Min.   :0.0416
1st Qu.:0.6122
Median :0.7220
Mean   :0.6794
3rd Qu.:0.7840
Max.   :0.9495

nrow(modellingDF[is.na(DecCol),])   # results in 0

modellingDF[,list(len=.N,DecCollen=length(DecCol),IntCollen=length
(IntCol ),Outcomelen=length(Outcome)),by=Bracket]

  Bracket  len DecCollen IntCollen Outcomelen
1:     3-6 39184  39184       39184      39184
2:     1-2 19909  19909       19909      19909
3:       0  9912   9912        9912       9912

Perhaps I'm having a dozy day, but could anyone suggest a solution or a means for digging into this issue further?

解决方案

You need to correctly specify the data argument within glm. Inside a data.table (using [), this is referenced by .SD. (see create a formula in a data.table environment in R for related question)

So

modellingDF[,list(Outcome, fitted = glm(data = .SD, 
  formula = Outcome ~ IntCol ,family = binomial(link = logit))$fitted),
 by=variable]

will work.

While in this case (simply extracting the fitted values and moving on), this approach is sound, using data.table and .SD can get in a mess of environments if you are saving the whole model and then attempting to update it (see Why is using update on a lm inside a grouped data.table losing its model data?)

这篇关于R - 在data.table中使用glm的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆