R中的逐步回归误差 [英] Stepwise regression error in R

查看:789
本文介绍了R中的逐步回归误差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在R中进行逐步回归以选择最佳拟合模型,我的代码附在这里:

I want to run a stepwise regression in R to choose the best fit model, my code is attached here:

full.modelfixed <- glm(died_ed ~ age_1 + gender + race + insurance + injury + ais + blunt_pen + 
               comorbid + iss +min_dist + pop_dens_new + age_mdn + male_pct + 
               pop_wht_pct + pop_blk_pct + unemp_pct + pov_100x_npct +
               urban_pct, data = trauma, family = binomial (link = 'logit'), na.action = na.exclude)
reduced.modelfixed <- stepAIC(full.modelfixed, direction = "backward")

有一条错误消息说

Error in stepAIC(full.modelfixed, direction = "backward") :   
number of rows in use has changed: remove missing values?

数据中几乎每个变量都有一些缺失值,因此我无法删除所有缺失值(data = na.omit(data))

Almost every variable in the data has some missing values, so I cannot delete all missing values (data = na.omit(data))

关于如何解决此问题的任何想法?

Any idea on how to fix this?

谢谢!

推荐答案

这可能应该在统计论坛(stats.stackexchange)中,但有很多注意事项.

This should probably be in a stats forum (stats.stackexchange) but briefly there are a number of considerations.

主要的一点是,在比较两个模型时,它们需要适合于同一数据集(即,您需要能够将模型嵌套在彼此之间).

The main one is that when comparing two models they need to be fitted on the same dataset (i.e you need to be able to nest the models within each other).

例如

glm1 <- glm(Dependent~indep1+indep2+indep3, family = binomial, data = data)
glm2 <- glm(Dependent~indep2+indep2, family = binomial, data = data)

现在想象一下,我们缺少了indep3的值,但没有缺失indep1或indep2的值. 当我们运行glm1时,我们正在一个较小的数据集上运行它-我们拥有因变量和所有三个独立变量的数据集(即,排除了缺少indep3值的任何行).

Now imagine that we are missing values of indep3 but not indep1 or indep2. When we run glm1 we are running it on a smaller dataset - the dataset for which we have the dependent variable and all three independent ones (i.e we exclude any rows where indep3 values are missing).

运行glm2时,会包含缺少indep3值的行,因为这些行的确包含变量中的模型indep1和indep2.

When we run glm2 the rows missing a value for indep3 are included because those rows do contain dependent, indep1 and indep2 which are the models in the variable.

我们无法再直接比较模型,因为它们适合不同的数据集.

We can no longer directly compare models as they are fitted on different datasets.

我认为您可以 1)限制完整的数据 2)适当时考虑多重插补

I think broadly you can either 1) Limit to data which is complete 2) If appropriate consider multiple imputation

希望有帮助.

这篇关于R中的逐步回归误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆