在R中如何处理多种缺失? [英] How do I handle multiple kinds of missingness in R?

查看:247
本文介绍了在R中如何处理多种缺失?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多调查都有不同种类缺失的代码。例如,码本可能表示:


0-99数据



- 1问题不问



-5不知道



-7拒绝回复



-9模块不问


Stata拥有处理这些多种缺失的美丽设施,因为它允许你分配一个通用的。丢失数据,但也允许使用更具体种类的缺失(.a,.b,.c,...,.z)。所有查看缺失的命令都会报告所有缺失的条目的答案,但是您可以随时对各种缺失进行整理。当您认为拒绝回复对于插补策略有不同的含义时,这一点特别有帮助。



我从来没有遇到过R中的这样一个设施,但我真的很想拥有这种能力。有没有办法标记几种不同类型的NA?我可以想象创建更多的数据(一个包含缺失类型的长度为nrow(my.data.frame)的向量,或者一个更紧凑的索引,哪些行有什么类型的缺失),但是看起来很笨重。 >

解决方案

我知道你在寻找什么,而不是在R中实现。我不知道一个实现的包,但是自己编写代码并不困难。



可行的方法是将数据框添加到包含代码的属性中。为了防止整个数据帧翻倍并节省空间,我将在数据帧中添加索引,而不是重建完整的数据框。



例如:

  NACode < -  function(x,code){
Df < - sapply(x,function(i){
i [ i%in%code]< - NA
i
})

id < - which(is.na(Df))
rowid< - id %% nrow(x)
colid< - id%/%nrow(x)+ 1
NAdf< - data.frame(
id,rowid,colid,
value = as.matrix(x)[id]

Df < - as.data.frame(Df)
attr(Df,NAcode)< -NAdf
Df
}

这允许:

 > Df<  -  data.frame(A = 1:10,B = c(1:5,-1,-2,-3,9,10))
>代码< - list(Missing= - 1,Not Answered= - 2,Do not know= - 3)
> DfwithNA< - NACode(Df,code)
> str(DfwithNA)
'data.frame':10 obs。的2个变量:
$ A:num 1 2 3 4 5 6 7 8 9 10
$ B:num 1 2 3 4 5 NA NA NA 9 10
- attr(*, NAcode)='data.frame':3 obs。的4个变量:
.. $ id:int 16 17 18
.. $ rowid:int 6 7 8
.. $ colid:num 2 2
.. $值:num -1 -2 -3

还可以调整该函数以添加额外的属性,为您提供不同值的标签,另请参阅这个问题。您可以通过以下方式进行转换:

  ChangeNAToCode<  -  function(x,code){
NAval< - attr x,NAcode)
for(i in(NAval $ value%in%code))
x [NAval $ rowid [i],NAval $ colid [i]]< - NAval $ value [i]

x
}

> Dfback< - ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame':10 obs。的2个变量:
$ A:num 1 2 3 4 5 6 7 8 9 10
$ B:num 1 2 3 4 5 NA -2 -3 9 10
- attr(* ,NAcode)='data.frame':3 obs。的4个变量:
.. $ id:int 16 17 18
.. $ rowid:int 6 7 8
.. $ colid:num 2 2
.. $值:num -1 -2 -3

只允许更改所需的代码,如果有必要当没有给出参数时,该函数可以适用于返回所有代码。可以构造类似的函数来提取基于代码的数据,我想你可以自己一个数据。



但是在一行中:使用属性和索引可能是很好的做法。


Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:

0-99 Data

-1 Question not asked

-5 Do not know

-7 Refused to respond

-9 Module not asked

Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.

I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.

解决方案

I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.

A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.

eg :

NACode <- function(x,code){
    Df <- sapply(x,function(i){
        i[i %in% code] <- NA
        i
    })

    id <- which(is.na(Df))
    rowid <- id %% nrow(x)
    colid <- id %/% nrow(x) + 1
    NAdf <- data.frame(
        id,rowid,colid,
        value = as.matrix(x)[id]
    )
    Df <- as.data.frame(Df)
    attr(Df,"NAcode") <- NAdf
    Df
}

This allows to do :

> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA NA NA 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :

ChangeNAToCode <- function(x,code){
    NAval <- attr(x,"NAcode")
    for(i in which(NAval$value %in% code))
        x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]

    x
}

> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA -2 -3 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.

But in one line : using attributes and indices might be a nice way of doing it.

这篇关于在R中如何处理多种缺失?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆