指定不同类型的缺失值(NA) [英] Specify different types of missing values (NAs)
问题描述
我想指定缺失值的类型.我的数据有不同类型的缺失,我正在尝试将这些值编码为R中的缺失,但是我正在寻找一种解决方案,以使我仍然能够区分它们.
I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them.
说我有一些看起来像这样的数据,
Say I have some data that looks like this,
set.seed(667)
df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"), 20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99), 10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("C","M","Y","K"), 10, rep=TRUE) ); df
# a b f g
# 1 Unknown 2 0.78 M
# 2 Refused 2 0.87 M
# 3 Red 77 0.82 Y
# 4 Red 99 0.78 Y
# 5 Green 77 0.97 M
# 6 Green 3 0.99 K
# 7 Red 3 0.99 Y
# 8 Green 88 0.84 C
# 9 Unknown 99 1.08 M
# 10 Refused 99 0.81 C
# 11 Blue 2 0.78 M
# 12 Green 2 0.87 M
# 13 Blue 77 0.82 Y
# 14 Don't know/Not sure 99 0.78 Y
# 15 Unknown 77 0.97 M
# 16 Refused 3 0.99 K
# 17 Blue 3 0.99 Y
# 18 Green 88 0.84 C
# 19 Refused 99 1.08 M
# 20 Red 99 0.81 C
如果我现在制作两个表,我的缺失值("Don't know/Not sure","Unknown","Refused"
和77, 88, 99
)将作为常规数据包括在内,
If I now make two tables my missing values ("Don't know/Not sure","Unknown","Refused"
and 77, 88, 99
) are included as regular data,
table(df$a,df$g)
# C K M Y
# Blue 0 0 1 2
# Don't know/Not sure 0 0 0 1
# Green 2 1 2 0
# Red 1 0 0 3
# Refused 1 1 2 0
# Unknown 0 0 3 0
和
table(df$b,df$g)
# C K M Y
# 2 0 0 4 0
# 3 0 2 0 2
# 77 0 0 2 2
# 88 2 0 0 0
# 99 2 0 2 2
我现在将三个因子级别"Don't know/Not sure","Unknown","Refused"
重新编码为<NA>
I now recode the three factor levels "Don't know/Not sure","Unknown","Refused"
into <NA>
is.na(df[,c("a")]) <- df[,c("a")]=="Don't know/Not sure"|df[,c("a")]=="Unknown"|df[,c("a")]=="Refused"
并删除空白级别
df$a <- factor(df$a)
,对数字值77, 88,
和99
is.na(df) <- df=="77"|df=="88"|df=="99"
table(df$a, df$g, useNA = "always")
# C K M Y <NA>
# Blue 0 0 1 2 0
# Green 2 1 2 0 0
# Red 1 0 0 3 0
# <NA> 1 1 5 1 0
table(df$b,df$g, useNA = "always")
# C K M Y <NA>
# 2 0 0 4 0 0
# 3 0 2 0 2 0
# <NA> 4 0 4 4 0
现在,缺少的类别被重新编码为NA
,但它们都集中在一起.是否有一种方法可以将某些内容重新编码为丢失的内容,但保留原始值?我希望R缺少"Don't know/Not sure","Unknown","Refused"
和77, 88, 99
线程,但我希望仍能在变量中包含信息.
Now the missing categories are recode into NA
but they are all lumped together. Is there a way in a to recode something as missing, but retain the original values? I want R to thread "Don't know/Not sure","Unknown","Refused"
and 77, 88, 99
as missing, but I want to be able to still have the information in the variable.
推荐答案
据我所知,base R没有内置的方式来处理不同的NA
类型. (编辑器::NA_integer_
,NA_real_
,NA_complex_
和NA_character
.请参见?base::NA
.)
To my knowledge, base R doesn't have an in-built way to handle different NA
types. (editor: It does: NA_integer_
, NA_real_
, NA_complex_
, and NA_character
. See ?base::NA
.)
一个选择是使用一个软件包,例如" memisc ".这需要一些额外的工作,但它似乎可以满足您的需求.
One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.
这是一个例子:
首先,您的数据.我已经制作了一个副本,因为我们将对数据集进行一些非常重要的更改,并且拥有备份总是很高兴.
First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.
set.seed(667)
df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown",
"Refused", "Blue", "Red", "Green"),
20, replace = TRUE),
b = sample(c(1, 2, 3, 77, 88, 99), 10,
replace = TRUE),
f = round(rnorm(n = 10, mean = .90, sd = .08),
digits = 2),
g = sample(c("C", "M", "Y", "K"), 10,
replace = TRUE))
df2 <- df
让我们的因子变量"a":
Let's factor variable "a":
df2$a <- factor(df2$a,
levels = c("Blue", "Red", "Green",
"Don't know/Not sure",
"Refused", "Unknown"),
labels = c(1, 2, 3, 77, 88, 99))
加载"memisc"库:
Load the "memisc" library:
library(memisc)
现在,将变量"a"和"b"转换为"memisc"中的item
s:
Now, convert variables "a" and "b" to item
s in "memisc":
df2$a <- as.item(as.character(df2$a),
labels = structure(c(1, 2, 3, 77, 88, 99),
names = c("Blue", "Red", "Green",
"Don't know/Not sure",
"Refused", "Unknown")),
missing.values = c(77, 88, 99))
df2$b <- as.item(df2$b,
labels = c(1, 2, 3, 77, 88, 99),
missing.values = c(77, 88, 99))
这样做,我们有了新的数据类型.比较以下内容:
By doing this, we have a new data type. Compare the following:
as.factor(df2$a)
# [1] <NA> <NA> Red Red Green Green Red Green <NA> <NA> Blue
# [12] Green Blue <NA> <NA> <NA> Blue Green <NA> Red
# Levels: Blue Red Green
as.factor(include.missings(df2$a))
# [1] *Unknown *Refused Red
# [4] Red Green Green
# [7] Red Green *Unknown
# [10] *Refused Blue Green
# [13] Blue *Don't know/Not sure *Unknown
# [16] *Refused Blue Green
# [19] *Refused Red
# Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown
我们可以使用此信息来创建符合您描述方式的表格,同时保留所有原始信息.
We can use this information to create tables behaving the way you describe, while retaining all the original information.
table(as.factor(include.missings(df2$a)), df2$g)
#
# C K M Y
# Blue 0 0 1 2
# Red 1 0 0 3
# Green 2 1 2 0
# *Don't know/Not sure 0 0 0 1
# *Refused 1 1 2 0
# *Unknown 0 0 3 0
table(as.factor(df2$a), df2$g)
#
# C K M Y
# Blue 0 0 1 2
# Red 1 0 0 3
# Green 2 1 2 0
table(as.factor(df2$a), df2$g, useNA="always")
#
# C K M Y <NA>
# Blue 0 0 1 2 0
# Red 1 0 0 3 0
# Green 2 1 2 0 0
# <NA> 1 1 5 1 0
缺少数据的数字列表的行为相同.
The tables for the numeric column with missing data behaves the same way.
table(as.factor(include.missings(df2$b)), df2$g)
#
# C K M Y
# 1 0 0 0 0
# 2 0 0 4 0
# 3 0 2 0 2
# *77 0 0 2 2
# *88 2 0 0 0
# *99 2 0 2 2
table(as.factor(df2$b), df2$g, useNA="always")
#
# C K M Y <NA>
# 1 0 0 0 0 0
# 2 0 0 4 0 0
# 3 0 2 0 2 0
# <NA> 4 0 4 4 0
作为奖励,您可以轻松生成codebook
s:
As a bonus, you get the facility to generate nice codebook
s:
> codebook(df2$a)
========================================================================
df2$a
------------------------------------------------------------------------
Storage mode: character
Measurement: nominal
Missing values: 77, 88, 99
Values and labels N Percent
1 'Blue' 3 25.0 15.0
2 'Red' 4 33.3 20.0
3 'Green' 5 41.7 25.0
77 M 'Don't know/Not sure' 1 5.0
88 M 'Refused' 4 20.0
99 M 'Unknown' 3 15.0
但是,我也建议您阅读评论 @ Maxim.K,了解真正构成缺失值的原因.
However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.
这篇关于指定不同类型的缺失值(NA)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!