如何绘制来自非常大的数据集(尤其是来自rxGlm输出)的交互作用 [英] How to plot interaction effects from extremely large data sets (esp. from rxGlm output)
问题描述
我正在计算glm
模型是基于一个巨大的数据数据集. glm
甚至speedglm
都需要花费几天的时间来计算.
I am currenlty computing glm
models off a huge data data set. Both glm
and even speedglm
take days to compute.
我目前有大约3M个观测值和总共400个变量,其中只有一些用于回归.在回归分析中,我使用4个整数自变量(iv1
,iv2
,iv3
,iv4
),1个二进制自变量作为因子(iv5
),交互项(x * y
,其中y
是二进制虚拟变量作为因子).最后,我对ff1
年和公司ID ff2
都有固定的影响.我有15年的经验,拥有3000家公司.我已经通过将固定效果添加为因素来介绍了固定效果.我观察到,尤其是3000公司固定效果使stats
glm
以及speedglm
的计算都非常缓慢.
I currently have around 3M observations and altogether 400 variables, only some of which are used for the regression. In my regression I use 4 integer independent variables (iv1
, iv2
, iv3
, iv4
), 1 binary independent variable as factor (iv5
), the interaction term (x * y
, where x
is an integer and y
is a binary dummy variable as factor). Finally, I have fixed effects along years ff1
and company ids ff2
. I have 15 years and 3000 conmpanies. I have introduced the fixed effects by adding them as factors. I observe that especially the 3000 company fixed effects make the computation very slow in stats
glm
and also speedglm
.
因此,我决定尝试使用Microsoft R的rxGlm
(RevoScaleR),因为它可以处理更多的线程和处理器内核.确实,分析速度要快得多.另外,我将子样本的结果与标准glm
的样本进行了比较,并且结果相符.
I therefore decided to try Microsoft R's rxGlm
(RevoScaleR), as this can address more threads and processor cores. Indeed, the speed of analysis is a lot faster. Also, I compared the results for a sub-sample to the one of standard glm
and they matched.
我使用了以下功能:
mod1 <- rxGlm(formula = dv ~
iv1 + iv2 + iv3+
iv4 + iv5 +
x * y +
ff1 + ff2,
family = binomial(link = "probit"), data = dat,
dropFirst = TRUE, dropMain = FALSE, covCoef = TRUE, cube = FALSE)
但是,当尝试使用effects
程序包绘制交互作用项时,我遇到了一个问题.调用以下函数后,我收到以下错误:
However, I am facing a problem when trying to plot the interaction term using the effects
package. Upon calling the following function, I am receiving the following error:
> plot(effect("x*y", mod1))
Error in terms.default(model) : no terms component nor attribute
我认为问题是rxGlm
没有存储绘制交互作用所需的数据.我相信是因为rxGlm
对象比glm
对象小很多,因此可能包含的数据更少(80 MB与数GB).
I assume the problem is that rxGlm
does not store the data needed to plot the interaction. I believe so because the rxGlm
object is a lot smaller than the glm
oject, hence likely containing less data (80 MB vs several GB).
我现在尝试通过as.glm()
将rxGlm
对象转换为glm
.尽管如此,effects()
调用仍未产生结果,并导致以下错误消息:
I now tried to convert the rxGlm
object to glm
via as.glm()
. Still, the effects()
call does not yield a result and results in the following error messages:
Error in dnorm(eta) :
Non-numerical argument for mathematical function
In addition: Warning messages:
1: In model.matrix.default(mod, data = list(dv = c(1L, 2L, :
variable 'x for y' is absent, its contrast will be ignored
如果现在将原始的glm与转换的glm"进行比较,我发现转换后的glm包含的项目要少得多.例如,它不包含effects
,并且为对比起见,每个变量仅声明contr.treatment
.
If I now compare an original glm to the "converted glm", I find that the converted glm contains a lot less items. E.g., it does not contain effects
and for contrasts it states only contr.treatment
for each variable.
我现在主要是在寻找一种以某种格式转置rxGlm
输出对象的方法,这样我就可以在effect()
函数中使用它.如果没有办法,如何使用RevoScaleR
包中的函数(例如rxLinePlot()
)获得交互作用图? rxLinePlot()
的绘制也相当快,但是,我还没有找到一种方法来获得典型的交互作用图.我要避免先计算完整的glm
模型,然后再绘制,因为这会花费很长时间.
I am now looking primarily for a way to transpose the rxGlm
output object in a format so I can use if with the effect()
function. If there is no way to do so, how can I get an interaction plot using functions within the RevoScaleR
package, e.g., rxLinePlot()
? rxLinePlot()
also plots reasonably quick, however, I have not yet found a way how to get typical interaction effect plots out of it. I want to avoid first calculating the full glm
model and then plot because this takes very long.
推荐答案
如果可以获取系数,就不能自己滚动系数吗? 不是会是数据集大小问题
If you can get the coefficients can't you just roll your own? This would not be a dataset size issue
# ex. data
n = 2000
dat <- data.frame( dv = sample(0:1, size = n, rep = TRUE),
iv1 = sample(1:10, size = n, rep = TRUE),
iv2 = sample(1:10, size = n, rep = TRUE),
iv3 = sample(1:10, size = n, rep = TRUE),
iv4 = sample(0:10, size = n, rep = TRUE),
iv5 = as.factor(sample(0:1, size = n, rep = TRUE)),
x = sample(1:100, size = n, rep = TRUE),
y = as.factor(sample(0:1, size = n, rep = TRUE)),
ff1 = as.factor(sample(1:15, size = n, rep = TRUE)),
ff2 = as.factor(sample(1:100, size = n, rep = TRUE))
)
mod1 <- glm(formula = dv ~
iv1 + iv2 + iv3+
iv4 + iv5 +
x * y +
ff1 + ff2,
family = binomial(link = "probit"), data = dat)
# coefficients for x, y and their interaction
x1 <- coef(mod1)['x']
y1 <- coef(mod1)['y1']
xy <- coef(mod1)['x:y1']
x <- 1:100
a <- x1*x
b <- x1*x + y1 + xy*x
plot(a~x, type= 'line', col = 'red', xlim = c(0,max(x)), ylim = range(c(a, b)))
lines(b~x, col = 'blue')
legend('topright', c('y = 0', 'y = 1'), col = c('red', 'blue'))
这篇关于如何绘制来自非常大的数据集(尤其是来自rxGlm输出)的交互作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!