模型以glm运行,但不是bigglm [英] Model runs with glm but not bigglm

查看:101
本文介绍了模型以glm运行,但不是bigglm的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图对320,000行数据(6个变量)进行逻辑回归.对数据样本进行逐步模型选择(10000),可以得到一个具有5个交互项的相当复杂的模型:Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. glm()函数可以使该模型具有10000行数据,而不适合整个数据集(320,000).

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).

使用bigglm从SQL Server逐块读取数据会导致错误,我无法理解traceback()的结果:

Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():

fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5, 
       data=sqlQuery(myconn,train_dat),family=binomial(link="logit"), 
       chunksize=1000, maxit=10)

Error in coef.bigqr(object$qr) : 
NA/NaN/Inf in foreign function call (arg 3)

> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D, 
    bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar), 
    ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)

bigglm能够以较少的交互项拟合较小的模型.但是bigglm无法使用较小的数据集(10000行)来拟合相同的模型.

bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).

以前有人遇到过这个问题吗?还有其他方法来运行具有大数据的复杂物流模型吗?

Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?

推荐答案

我已经多次遇到此问题,这总是由bigglm处理的块未包含所有级别的事实引起的.分类(因子)变量.

I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.

bigglm按数据块处理数据,数据块的默认大小为5000.例如,如果分类变量中有5个级别,例如(a,b,c,d,e),并且在您的第一个块中(从1:5000开始)仅包含(a,b,c,d),但不包含"e",则将出现此错误.

bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.

您可以做的是增加"chunksize"参数的大小和/或巧妙地对数据框重新排序,以使每个块都包含所有级别.

What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.

希望这可以帮助(至少有人)

hope this helps (at least somebody)

这篇关于模型以glm运行,但不是bigglm的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆