如何绘制来自非常大的数据集(尤其是来自rxGlm输出)的交互作用 [英] How to plot interaction effects from extremely large data sets (esp. from rxGlm output)

查看:177
本文介绍了如何绘制来自非常大的数据集(尤其是来自rxGlm输出)的交互作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在计算glm模型是基于一个巨大的数据数据集. glm甚至speedglm都需要花费几天的时间来计算.

I am currenlty computing glm models off a huge data data set. Both glm and even speedglm take days to compute.

我目前有大约3M个观测值和总共400个变量,其中只有一些用于回归.在回归分析中,我使用4个整数自变量(iv1iv2iv3iv4),1个二进制自变量作为因子(iv5),交互项(x * y,其中是整数,而y是二进制虚拟变量作为因子).最后,我对ff1年和公司ID ff2都有固定的影响.我有15年的经验,拥有3000家公司.我已经通过将固定效果添加为因素来介绍了固定效果.我观察到,尤其是3000公司固定效果使stats glm以及speedglm的计算都非常缓慢.

I currently have around 3M observations and altogether 400 variables, only some of which are used for the regression. In my regression I use 4 integer independent variables (iv1, iv2, iv3, iv4), 1 binary independent variable as factor (iv5), the interaction term (x * y, where x is an integer and y is a binary dummy variable as factor). Finally, I have fixed effects along years ff1 and company ids ff2. I have 15 years and 3000 conmpanies. I have introduced the fixed effects by adding them as factors. I observe that especially the 3000 company fixed effects make the computation very slow in stats glm and also speedglm.

因此,我决定尝试使用Microsoft R的rxGlm(RevoScaleR),因为它可以处理更多的线程和处理器内核.确实,分析速度要快得多.另外,我将子样本的结果与标准glm的样本进行了比较,并且结果相符.

I therefore decided to try Microsoft R's rxGlm (RevoScaleR), as this can address more threads and processor cores. Indeed, the speed of analysis is a lot faster. Also, I compared the results for a sub-sample to the one of standard glm and they matched.

我使用了以下功能:

mod1 <- rxGlm(formula = dv ~ 
                      iv1 + iv2 + iv3+ 
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat,
                    dropFirst = TRUE, dropMain = FALSE, covCoef = TRUE, cube = FALSE)

但是,当尝试使用effects程序包绘制交互作用项时,我遇到了一个问题.调用以下函数后,我收到以下错误:

However, I am facing a problem when trying to plot the interaction term using the effects package. Upon calling the following function, I am receiving the following error:

> plot(effect("x*y", mod1))
Error in terms.default(model) : no terms component nor attribute

我认为问题是rxGlm没有存储绘制交互作用所需的数据.我相信是因为rxGlm对象比glm对象小很多,因此可能包含的数据更少(80 MB与数GB).

I assume the problem is that rxGlm does not store the data needed to plot the interaction. I believe so because the rxGlm object is a lot smaller than the glm oject, hence likely containing less data (80 MB vs several GB).

我现在尝试通过as.glm()rxGlm对象转换为glm.尽管如此,effects()调用仍未产生结果,并导致以下错误消息:

I now tried to convert the rxGlm object to glm via as.glm(). Still, the effects() call does not yield a result and results in the following error messages:

Error in dnorm(eta) : 
  Non-numerical argument for mathematical function
In addition: Warning messages:
1: In model.matrix.default(mod, data = list(dv = c(1L, 2L,  :
  variable 'x for y' is absent, its contrast will be ignored

如果现在将原始的glm与转换的glm"进行比较,我发现转换后的glm包含的项目要少得多.例如,它不包含effects,并且为对比起见,每个变量仅声明contr.treatment.

If I now compare an original glm to the "converted glm", I find that the converted glm contains a lot less items. E.g., it does not contain effects and for contrasts it states only contr.treatment for each variable.

我现在主要是在寻找一种以某种格式转置rxGlm输出对象的方法,这样我就可以在effect()函数中使用它.如果没有办法,如何使用RevoScaleR包中的函数(例如rxLinePlot())获得交互作用图? rxLinePlot()的绘制也相当快,但是,我还没有找到一种方法来获得典型的交互作用图.我要避免先计算完整的glm模型,然后再绘制,因为这会花费很长时间.

I am now looking primarily for a way to transpose the rxGlm output object in a format so I can use if with the effect() function. If there is no way to do so, how can I get an interaction plot using functions within the RevoScaleR package, e.g., rxLinePlot()? rxLinePlot() also plots reasonably quick, however, I have not yet found a way how to get typical interaction effect plots out of it. I want to avoid first calculating the full glm model and then plot because this takes very long.

推荐答案

如果可以获取系数,就不能自己滚动系数吗? 不是会是数据集大小问题

If you can get the coefficients can't you just roll your own? This would not be a dataset size issue

# ex. data
n = 2000
dat <- data.frame( dv = sample(0:1, size = n, rep = TRUE), 
                   iv1 = sample(1:10, size = n, rep = TRUE),
                   iv2 = sample(1:10, size = n, rep = TRUE),
                   iv3 = sample(1:10, size = n, rep = TRUE),
                   iv4 = sample(0:10, size = n, rep = TRUE),
                   iv5 = as.factor(sample(0:1, size = n, rep = TRUE)),
                   x = sample(1:100, size = n, rep = TRUE),
                   y = as.factor(sample(0:1, size = n, rep = TRUE)),
                   ff1  = as.factor(sample(1:15, size = n, rep = TRUE)),
                   ff2  = as.factor(sample(1:100, size = n, rep = TRUE))
                   )

mod1 <- glm(formula = dv ~ 
                      iv1 + iv2 + iv3+ 
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat)

# coefficients for x, y and their interaction
x1 <- coef(mod1)['x']
y1 <- coef(mod1)['y1']
xy <- coef(mod1)['x:y1']

x <- 1:100
a <- x1*x
b <- x1*x + y1 + xy*x

plot(a~x, type= 'line', col = 'red', xlim = c(0,max(x)), ylim = range(c(a, b)))
lines(b~x, col = 'blue')
legend('topright', c('y = 0', 'y = 1'), col = c('red', 'blue'))

这篇关于如何绘制来自非常大的数据集(尤其是来自rxGlm输出)的交互作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆