使用 sapply 的中值插补 [英] Median imputation using sapply

查看:31
本文介绍了使用 sapply 的中值插补的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想替换数据框列中的缺失值.我写了以下代码

I want to replace missing values in columns of a dataframe. I have written the following code

MedianImpute <- function(data=data)
     {
      for(i in 1:ncol(data))
        {        
        if(class(data[,i]) %in% c("numeric","integer"))
          {
          if(sum(is.na(data[,i])))
            {
            data[is.na(data[,i]),i] <- 
                          median(data[,i],na.rm = TRUE)
            }
          }
        }
      return(data)
      }

这将返回 NA 替换为列中位数的数据帧.我不想使用 for 循环,如何使用 R 中的任何应用函数获得相同的结果?

This returns the dataframe with the NAs replaced by the column median. I do no want to use for loop, how can I get the same result using any of the apply functions in R?

推荐答案

这实际上是一个微妙的问题,因此值得讨论一下 (IMO).您有一个 data frame 并且只想计算数字列的中位数,结果当然是一个数据框.

This is actually a subtle problem, so worth a bit of discussion (IMO). You have a data frame and want to impute medians for numeric columns only, with the result being, of course, a data frame.

apply(...) 函数将首先将其参数强制转换为矩阵.由于矩阵中的所有元素根据定义必须是相同的数据类型,如果原始df中有任何字符或因子列,则整个矩阵在传递给>应用(...).

The apply(...) function will coerce it's argument to a matrix first. Since all elements in a matrix must by definition be the same data type, if there are any character or factor columns in the original df, the whole matrix will be coerced to char when it is passed to apply(...).

# 1st column of df is a factor
df <- data.frame(a=letters[1:5],x=sample(1:5,5),y=runif(5))
df[3,]$x <- NA
df[5,]$y <- NA
df
#   a  x         y
# 1 a  5 0.5235779
# 2 b  3 0.2142011
# 3 c NA 0.8886608
# 4 d  4 0.4952574
# 5 e  1        NA

apply(df,2,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
#      a   x    y          
# [1,] "a" " 5" "0.5235779"
# [2,] "b" " 3" "0.2142011"
# [3,] "c" NA   "0.8886608"
# [4,] "d" " 4" "0.4952574"
# [5,] "e" " 1" NA         

sapply(df,FUN=f) 会将 df 的列分别传递给函数 f(...),但是, 结果将是矩阵.因此,例如,df 中的任何因子都将被强制转换为整数.

sapply(df,FUN=f) will pass the columns of df individually to a function f(...), but, the result will be matrix. So, for example, any factors in df will be coerced to integer.

sapply(df,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
#      a   x         y
# [1,] 1 5.0 0.5235779
# [2,] 2 3.0 0.2142011
# [3,] 3 3.5 0.8886608
# [4,] 4 4.0 0.4952574
# [5,] 5 1.0 0.5094176

所以在这里,df$xdf$y 是正确的,但是看看 df$a 发生了什么:因子被强制通过返回因子水平到数字 - 不是你想要的!

So here, df$x and df$y are correct,but look what happened to df$a: the factor was coerced to numeric by returning the factor levels - not what you want!

lapply(df,FUN=F) 将返回一个列表,然后可以将其转换为数据框.这种方法可以为您提供所需的结果:

lapply(df,FUN=F) will return a list, which can then be converted to a data frame. This approach gives you the desired result:

data.frame(lapply(df,function(x) {
    if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x}))
#   a   x         y
# 1 a 1.0 0.3093707
# 2 b 3.0 0.3486391
# 3 c 3.5 0.8292446
# 4 d 5.0 0.7882574
# 5 e 4.0 0.5684483

我认为这是否比使用循环更好是值得商榷的...

I suppose it's debatable whether this is any better than using a loop...

这篇关于使用 sapply 的中值插补的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆