R 中的因子水平默认为 1 和 2 |虚拟变量 [英] Factor levels default to 1 and 2 in R | Dummy variable

查看:99
本文介绍了R 中的因子水平默认为 1 和 2 |虚拟变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从 Stata 过渡到 R.在 Stata 中,如果我将因子水平(比如-0 和 1)标记为(M 和 F),则 0 和 1 将保持原样.此外,在大多数软件(包括 Excel 和 SPSS)中,这是虚拟变量线性回归所必需的.

I am transitioning from Stata to R. In Stata, if I label a factor levels (say--0 and 1) to (M and F), 0 and 1 would remain as they are. Moreover, this is required for dummy-variable linear regression in most software including Excel and SPSS.

但是,我注意到 R 默认因子级别为 1,2 而不是 0,1.我不知道为什么 R 会这样做,尽管内部回归(并且正确地)假定 0 和 1 作为因子变量.我将不胜感激任何帮助.

However, I've noticed that R defaults factor levels to 1,2 instead of 0,1. I don't know why R does this although regression internally (and correctly) assumes 0 and 1 as the factor variable. I would appreciate any help.

这是我所做的:

尝试 #1:

sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(1,0),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 2 1 2 1 1

现在似乎因子水平被重置为 1 和 2.我相信 1 和 2s 是对此处因子水平的引用.但是,我丢失了原始值,即 0 和 1.

It seems that factor levels are now reset to 1 and 2. I believe 1 and 2s are references to the factor level here. However, I have lost the original values i.e. 0s and 1s.

尝试 2:

sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(0,1),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 1 2 1 2 2

同上.我的 0 和 1 现在是 1 和 2.相当令人惊讶.为什么会发生这种情况.

Ditto. My 0's and 1's are now 1's and 2's. Quite Surprising. Why is this happening.

尝试3现在,我想看看 1s 和 2s 是否有任何不良影响回归.所以,这就是我所做的:

Try3 Now, I wanted to see whether 1s and 2s have any bad effect regression. So, here's what I did:

这是我的数据的样子:

> head(data.frame(sassign$total_,sassign$gender))
  sassign.total_ sassign.gender
1            357              M
2            138              M
3            172              F
4            272              F
5            149              F
6            113              F

myfit<-lm(sassign$total_ ~ sassign$gender)

myfit$coefficients
    (Intercept) sassign$genderM 
      200.63522        23.00606  

所以,结果证明手段是正确的.在运行回归时,R 确实使用了 0 和 1 值作为虚拟变量.

So, it turns out that the means are correct. While running the regression, R did use 0 and 1 value as dummies.

我确实检查了 SO 上的其他线程,但他们大多谈论 R 如何编码因子变量而没有告诉我原因.Stata 和 SPSS 通常要求基变量为0".所以,我想问问这个.

I did check other threads on SO, but they mostly talk about how R codes factor variables without telling me why. Stata and SPSS generally require the base variable to be "0." So, I thought of asking about this.

我很感激任何想法.

推荐答案

简而言之,你只是混淆了两个不同的概念.我将在下面一一阐明.

你在str()

您从 str() 中看到的是因子变量的内部表示.一个因子在内部是一个整数,其中的数字给出了向量内水平的位置.例如:

What you see from str() is the internal representation of a factor variable. A factor is internally an integer, where the number gives the position of levels inside the vector. For example:

x <- gl(3, 2, labels = letters[1:3])
#[1] a a b b c c
#Levels: a b c

storage.mode(x)  ## or `typeof(x)`
#[1] "integer"

str(x)
# Factor w/ 3 levels "a","b","c": 1 1 2 2 3 3

as.integer(x)
#[1] 1 1 2 2 3 3

levels(x)
#[1] "a" "b" "c"

此类位置的常见用途是以最有效的方式执行 as.character(x):

A common use of such positions, is to perform as.character(x) in the most efficient way:

levels(x)[x]
#[1] "a" "a" "b" "b" "c" "c"

<小时>

您对模型矩阵的误解

在我看来,您认为模型矩阵是通过

It seems to me that you thought a model matrix is obtained by

cbind(1L, as.integer(x))
#     [,1] [,2]
#[1,]    1    1
#[2,]    1    1
#[3,]    1    2
#[4,]    1    2
#[5,]    1    3
#[6,]    1    3

这不是真的.在这种情况下,您只是将因子变量视为数值变量.

which is not true. In this fashion, you are just treating a factor variable as a numerical variable.

模型矩阵是这样构建的:

The model matrix is constructed this way:

xlevels <- levels(x)
cbind(1L, match(x, xlevels[2], nomatch=0), match(x, xlevels[3], nomatch=0))
#     [,1] [,2] [,3]
#[1,]    1    0    0
#[2,]    1    0    0
#[3,]    1    1    0
#[4,]    1    1    0
#[5,]    1    0    1
#[6,]    1    0    1

10 分别表示匹配"/出现"和不匹配"/不出现".

The 1 and 0 implies "match" / "occurrence" and "no-match" / "no-occurrence", respectively.

R 例程 model.matrix 将有效地为您完成此操作,并提供易于阅读的列名和行名:

The R routine model.matrix will do this for you efficiently, with easy-to-read column names and row names:

model.matrix(~x)
#  (Intercept) xb xc
#1           1  0  0
#2           1  0  0
#3           1  1  0
#4           1  1  0
#5           1  0  1
#6           1  0  1

<小时>

自己编写一个R函数来生成模型矩阵

我们可以编写一个名义例程 mm 来生成模型矩阵.虽然它比 model.matrix 效率低得多,但它可能有助于更好地理解这一概念.

We could write a nominal routine mm to generate a model matrix. Though it is much less efficient than model.matrix, it may help one digest this concept better.

mm <- function (x, contrast = TRUE) {
  xlevels <- levels(x)
  lst <- lapply(xlevels, function (z) match(x, z, nomatch = 0L))
  if (contrast) do.call("cbind", c(list(1L), lst[-1]))
  else do.call("cbind", lst)
  }

例如,如果我们有一个具有 5 个级别的因子 y:

For example, if we have a factor y with 5 levels:

set.seed(1); y <- factor(sample(1:5, 10, replace=TRUE), labels = letters[1:5])
y
# [1] b b c e b e e d d a
#Levels: a b c d e
str(y)
#Factor w/ 5 levels "a","b","c","d",..: 2 2 3 5 2 5 5 4 4 1

其有/无对比处理的模型矩阵分别为:

Its model matrix with / without contrast treatment is respectively:

mm(y, TRUE)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    1    0    0    0
# [2,]    1    1    0    0    0
# [3,]    1    0    1    0    0
# [4,]    1    0    0    0    1
# [5,]    1    1    0    0    0
# [6,]    1    0    0    0    1
# [7,]    1    0    0    0    1
# [8,]    1    0    0    1    0
# [9,]    1    0    0    1    0
#[10,]    1    0    0    0    0

mm(y, FALSE)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    0    1    0    0    0
# [2,]    0    1    0    0    0
# [3,]    0    0    1    0    0
# [4,]    0    0    0    0    1
# [5,]    0    1    0    0    0
# [6,]    0    0    0    0    1
# [7,]    0    0    0    0    1
# [8,]    0    0    0    1    0
# [9,]    0    0    0    1    0
#[10,]    1    0    0    0    0

对应的model.matrix调用将分别为:

model.matrix(~ y)
model.matrix(~ y - 1)

这篇关于R 中的因子水平默认为 1 和 2 |虚拟变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆