在R中进行线性回归时,如何有条件地放弃对因子的NA观测? [英] How to drop NA observation of factors conditionally when doing linear regression in R?

查看:314
本文介绍了在R中进行线性回归时,如何有条件地放弃对因子的NA观测?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在R中创建一个简单的线性回归模型.

模型中有三个因子变量.

模型是

lm(Exercise ~ Econ + Job + Position)

其中锻炼"是数字相关变量,即锻炼的时间.

"Econ","Job","Position"都是因素变量.

经济"是一个人是否被雇用. (级别=受雇/未受雇)

工作"是一个人拥有的工作类型.此变量有五个级别.

位置"是一个人在工作场所中所处的位置.此变量也有五个级别.

我尝试进行线性回归并出现错误,

"contrasts can be applied only to factors with 2 or more levels"

我认为此错误是由于因子水平的NA引起的,因为如果"Econ"等于待业",则"Job"和"Position"具有NA值. (显然,失业者没有工作类型和职位)

如果我像下面分别回归两个模型,则不会发生错误.

lm(Exercise ~ Econ)

lm(Exercise ~ Job + Position)

但是,我需要一种可以根据需要自动使用变量的模型,以及一种结果表.因此,如果使用"了"Econ",则将"Job","Position"变量用于回归.如果"Econ"为待业",则将从模型中自动删除"Job","Position"变量.

我想要一个模型而不是两个模型的原因是通过将所有变量放入模型中,我可以看到就业"人群中经济"(就业或失业)的影响

如果我只是退缩

lm(Exercise ~ Job + Position)

我不知道就业的影响.

我想到了一种将作业"和位置"的所有NA值都设置为0 =待业水平"的解决方案,但是我不确定这是否可以解决问题,并认为这可能会导致多重共线性问题.

有什么方法可以根据其他一些因素变量自动/有条件地删除NA观测值?

下面是我的可复制示例.

    Exercise <- c(50, 30, 25, 44, 32, 50 ,22, 14)
    Econ <- as.factor(c(1, 0, 1, 1, 0, 0, 1, 1)) 
    # 0 = unemployed, 1 =  employed

    Job <- as.factor(c("A", NA, "B", "B", NA, NA, "A", "C"))

    Position <- as.factor(c("Owner", NA,"Employee", "Owner", 
                        NA, NA, "Employee", "Director")) 

    data <- data.frame(Exercise, Econ, Job, Position)

    str(data)

    lm(Exercise ~ Econ + Job + Position)

    lm(Exercise ~ Econ)

    lm(Exercise ~ Job + Position)

我要的是第一个模型lm(运动〜经济+工作+职位),但是我得到一个错误,因为对于所有Econ = 0(失业),工作和职位的值都是NA.

解决方案

如果您真的只希望第一个模型运行时没有错误(假设您正在使用相同的缺失值),那么您可以这样做.

>

lm(Exercise ~ as.integer(Econ) + Job + Position)

请注意,您实际上所做的一切与第三种模型的结果相同.

lm(Exercise ~ Job + Position) # third model
lm(Exercise ~ as.integer(Econ) + Job + Position) # first model

coef(lm(Exercise ~ Job + Position))
coef(lm(Exercise ~ as.integer(Econ) + Job + Position))

除非您更改处理缺失值的方式,否则要使用的第一个模型lm(Exercise ~ Econ + Job + Position)将等同于第三个模型lm(Exercise ~ Job + Position).这就是原因.

默认情况下,是lm函数中的na.action = na.omit.这意味着将删除任何具有预测变量或响应变量缺失值的行.您可以通过多种方式查看此内容.一种方法是应用model.matrix,这是lm在后台执行的操作.

model.matrix(Exercise ~ Econ + Job + Position)
  (Intercept) Econ1 JobB JobC PositionEmployee PositionOwner
1           1     1    0    0                0             1
3           1     1    1    0                1             0
4           1     1    1    0                0             1
7           1     1    0    0                1             0
8           1     1    0    1                0             0

正如您已经正确指出的那样,Econ = 0position = NA完全对齐.因此,lm放弃了这些观察,您最终得到Econ具有单个值,而lm不知道如何使用单个水平来处理因子.我通过使用as.integer()绕过了该错误,但是,您仍然只能得到一个只有单个值的预测变量.

接下来,lm将静默删除此类预测变量,这就是为什么要在as.integer(Econ)上获得系数的NA的原因.这是因为singular.ok = TRUE的默认设置.

如果要设置singular.ok = FALSE,您将得到一条错误消息,该错误基本上是您正在尝试拟合仅具有单个预测变量值的模型.

lm(Exercise ~ as.integer(Econ) + Job + Position, singular.ok = FALSE)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  singular fit encountered

I'm trying to do a simple linear regression model in R.

there are three factor variables in the model.

the model is

lm(Exercise ~ Econ + Job + Position)

where "Exercise" is numeric dependent variable, the amount of time exercising.

"Econ", "Job", "Position" are all factor variables.

"Econ" is whether a person is employed or not. (levels = employed / not employed)

"Job" is the job type a person has. There are five levels for this variable.

"Position" is the position a person has in the workplace. There are five levels for this variable also.

I tried to do a linear regression and got an error,

"contrasts can be applied only to factors with 2 or more levels"

I think this error is due to NA in the factor level, because if "Econ" is equal to 'unemployed', "Job" and "Position" has NA value. (Since obviously, unemployed people does not have job type and job position)

If I regress two model separately like below, no error occurs.

lm(Exercise ~ Econ)

lm(Exercise ~ Job + Position)

However, I want one model that can automatically use variables as needed, and one result table. So if "Econ" is 'employed', then "Job", "Position" variable is used for regression. If "Econ" is 'unemployed', then "Job", "Position" variable is automatically dropped from the model.

The reason I want one model instead of two model is by putting all variables in the model, I can see the effect of "Econ"(employed or unemployed) among people who are 'employed'

If I just regress

lm(Exercise ~ Job + Position)

I do not know the effect of employment.

I thought of a solution to put 0 = 'unemployed level' for all NA values of "Job" and "Position", but I am not sure this will solve problem, and thought this might lead to multicollinearity problem.

is there any way to automatically/conditionally drop NA observations according to some other factor variable?

Below are my reproducible example.

    Exercise <- c(50, 30, 25, 44, 32, 50 ,22, 14)
    Econ <- as.factor(c(1, 0, 1, 1, 0, 0, 1, 1)) 
    # 0 = unemployed, 1 =  employed

    Job <- as.factor(c("A", NA, "B", "B", NA, NA, "A", "C"))

    Position <- as.factor(c("Owner", NA,"Employee", "Owner", 
                        NA, NA, "Employee", "Director")) 

    data <- data.frame(Exercise, Econ, Job, Position)

    str(data)

    lm(Exercise ~ Econ + Job + Position)

    lm(Exercise ~ Econ)

    lm(Exercise ~ Job + Position)

Here what I want is first model lm(Exercise ~ Econ + Job + Position), but I get an error, because for all Econ = 0(Unemployed), Job and Position value is NA.

解决方案

If you really truly just want the first model to run without errors (assuming the same missing values handling you are using), then you could do this.

lm(Exercise ~ as.integer(Econ) + Job + Position)

Note, that all you have really done is found the same result as the third model.

lm(Exercise ~ Job + Position) # third model
lm(Exercise ~ as.integer(Econ) + Job + Position) # first model

coef(lm(Exercise ~ Job + Position))
coef(lm(Exercise ~ as.integer(Econ) + Job + Position))

Unless you change how you are handling missing values, the first model that you want lm(Exercise ~ Econ + Job + Position) would be equivalent to the third model lm(Exercise ~ Job + Position) Here is why.

By default, na.action = na.omit within the lm function. This means that any rows with any missing values for the predictor or response variables will be dropped. There are multiple ways you can see this. One is by applying model.matrix which is what lm will do under the hood.

model.matrix(Exercise ~ Econ + Job + Position)
  (Intercept) Econ1 JobB JobC PositionEmployee PositionOwner
1           1     1    0    0                0             1
3           1     1    1    0                1             0
4           1     1    1    0                0             1
7           1     1    0    0                1             0
8           1     1    0    1                0             0

As you already correctly pointed out, Econ = 0 is perfectly aligned with position = NA . Thus, lm is dropping those observations and you end up with Econ having a single value which lm does not know how to handle a factor with a single level. I bypassed this error by using as.integer() however, you still end up with a predictor with only a single value.

Next, lm will silently drop such predictors which is why you are getting an NA for the coefficient on as.integer(Econ). This is because the default for singular.ok = TRUE.

If you were to set singular.ok = FALSE you would get an error that is basically saying that you are trying to fit a model that has only a single value for a predictor.

lm(Exercise ~ as.integer(Econ) + Job + Position, singular.ok = FALSE)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  singular fit encountered

这篇关于在R中进行线性回归时,如何有条件地放弃对因子的NA观测?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆