R - 在data.table中使用glm [英] R - using glm inside a data.table
问题描述
我一直在做这个成功的方法:
我试图在data.table中使用一些glm来生成由关键因素分割的模型结果。 / p>
-
高级glm
glm(modellingDF,formula = Outcome〜 IntCol + DecCol,family = binomial(link = logit))
-
具有单列的范围glm
modellingDF [,list(Outcome,
/ li>
fitted = glm(x,formula = Outcome〜IntCol,family = binomial(link = logit))$ fits),
by = variable]
-
具有两个整数列的Scoped glm
modellingDF [,list(Outcome,
fitted = glm ,formula = Outcome〜IntCol + IntCol2,family = binomial(link = logit))$ fitted),
by = variable]
但是,当我尝试在范围内使用我的十进制列执行高级glm时,会产生此错误。
model.frame.default中的错误(公式= Outcome〜IntCol + DecCol,data = x,:
变量长度不同(对于'DecCol')
我想也许是因为分区的长度可变,所以我用一个可重现的例子测试:
library(data.table)
testing <-data.table(letters = sample(rep(LETTERS,5000),5000)
letters2 = sample(rep(LETTERS [1:5],10000),5000),
cont.var = rnorm(5000),
cont.var2 = round * 1000,0),
outcome = rbinom(5000,1,0.8)
,key =letters)
testing.glm< -testing [,list(outcome,
fit = glm(x,formula = outcome_cont.var + cont.var2,family = binomial(link = logit))$ fitted)
),by = list code>但这没有错误。我认为也许是由于NAs或东西,但data.table modellingDF的总结没有指示应该有任何问题:
DecCol
最小。 :0.0416
第一次查询:0.6122
中值:0.7220
平均值:0.6794
第三个查询:0.7840
最大。 :0.9495
nrow(modellingDF [is.na(DecCol),])#results in 0
modellingDF [,list(len = .N,DecCollen = length ),IntCollen = length
(IntCol),Outcomelen = length(Outcome)),by = Bracket]
Bracket len DecCollen IntCollen Outcomelen
1:3-6 39184 39184 39184 39184
2:1-2 19909 19909 19909 19909
3:0 9912 9912 9912 9912
$ b b也许我有一个舒适的一天,但任何人都可以提出解决方案或手段进一步挖掘这个问题。
解决方案您需要在
glm
中正确指定数据
参数。在data.table
(使用[
)中引用.SD
。 (请参阅在data.table环境中创建公式R 的相关问题)
因此
modellingDF [,list(Outcome,fitted = glm(data = .SD,
formula = Outcome〜IntCol,family = binomial(link = logit))$ fits),
by = variable]
在这种情况下(简单提取这个方法是合理的,使用
data.table
和.SD
可以得到一个如果您保存整个模型,然后尝试更新
(参见)I'm trying to do some glm's inside a data.table to produce modelled results split by key factors.
I've been doing this sucessfully for:
High level glm
glm(modellingDF,formula=Outcome~IntCol + DecCol,family=binomial(link=logit))
Scoped glm with single columns
modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol ,family=binomial(link=logit))$fitted ), by=variable]
Scoped glm with two integer columns
modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol + IntCol2 ,family=binomial(link=logit))$fitted ), by=variable]
But, when I try and do the high level glm inside the scope with my decimal column, it produces this error
Error in model.frame.default(formula = Outcome ~ IntCol + DecCol, data = x, : variable lengths differ (found for 'DecCol')
I thought perhaps it was due to variable lengths of the partitions, so I tested with a reproducible example:
library("data.table") testing<-data.table(letters=sample(rep(LETTERS,5000),5000), letters2=sample(rep(LETTERS[1:5],10000),5000), cont.var=rnorm(5000), cont.var2=round(rnorm(5000)*1000,0), outcome=rbinom(5000,1,0.8) ,key="letters") testing.glm<-testing[,list(outcome, fitted=glm(x,formula=outcome~cont.var+cont.var2,family=binomial(link=logit))$fitted) ),by=list(letters)]
But this did not have the error. I thought maybe it was due to NAs or something but a summary of the data.table modellingDF gives no indication that there should be any issues:
DecCol Min. :0.0416 1st Qu.:0.6122 Median :0.7220 Mean :0.6794 3rd Qu.:0.7840 Max. :0.9495 nrow(modellingDF[is.na(DecCol),]) # results in 0 modellingDF[,list(len=.N,DecCollen=length(DecCol),IntCollen=length (IntCol ),Outcomelen=length(Outcome)),by=Bracket] Bracket len DecCollen IntCollen Outcomelen 1: 3-6 39184 39184 39184 39184 2: 1-2 19909 19909 19909 19909 3: 0 9912 9912 9912 9912
Perhaps I'm having a dozy day, but could anyone suggest a solution or a means for digging into this issue further?
解决方案You need to correctly specify the
data
argument withinglm
. Inside adata.table
(using[
), this is referenced by.SD
. (see create a formula in a data.table environment in R for related question)So
modellingDF[,list(Outcome, fitted = glm(data = .SD, formula = Outcome ~ IntCol ,family = binomial(link = logit))$fitted), by=variable]
will work.
While in this case (simply extracting the fitted values and moving on), this approach is sound, using
data.table
and.SD
can get in a mess of environments if you are saving the whole model and then attempting toupdate
it (see Why is using update on a lm inside a grouped data.table losing its model data?)这篇关于R - 在data.table中使用glm的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!