如果行式NA的比例低于某个阈值,如何用行方式替换NA? [英] How to replace NAs with row means if proportion of row-wise NAs is below a certain threshold?

查看:92
本文介绍了如果行式NA的比例低于某个阈值,如何用行方式替换NA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为这个麻烦的问题道歉,但我目前正在从事心理健康研究.对于其中一种心理健康筛查工具,有15个变量,每个变量的值都可以为0-3.然后,将这15个变量的总和分配给每行/参与者的总分.该工具的文档指出,如果缺少特定行/参与者的值的20%以上,则总分也应视为缺失,但是,如果缺少某行的值的20%以上,则每个缺少的值应分配给该行的剩余值的平均值.

Apologies for the somewhat cumbersome question, but I am currently working on a mental health study. For one of the mental health screening tools there are 15 variables, each of which can have values of 0-3. The total score for each row/participant is then assigned by taking the sum of these 15 variables. The documentation for this tool states that if more than 20% of the values for a particular row/participant are missing, the total score should be taken as missing also, however if fewer than 20% of the values for a row are missing, each missing value should be assigned the mean of the remaining values for that row.

我决定要执行此操作,我将必须计算每个参与者的NA的比例,计算每个参与者不包括NA的所有15个变量的平均值,然后使用条件变异语句(或类似的东西)来检查是否NA的比例小于20%,如果这样,则在找到每一行的所有15个变量的总和之前,用该行的平均值替换相关列的NA.数据集除这15个列外还包含其他列,因此将函数应用于所有列将无济于事.

I decided that to do this I would have to calculate the proportion of NAs for each participant, calculate the mean of all 15 variables excluding NAs for each participant, and then use a conditional mutate statement (or something similar) that checked if the proportion of NAs was less than 20% and if so replaced NAs for the relevant columns with the mean value for that row, before finding the sum of all 15 variables for each row. The dataset also contains other columns besides these 15, so applying a function to all of the columns would not be useful.

要计算没有NA的平均得分,我做了以下操作:

To calculate the mean score without NAs I did the following:

mental$somatic_mean <- rowMeans(mental [, c("var1", "var2", "var3", 
"var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11", 
"var12","var13", "var14", "var15")], na.rm=TRUE)

并计算每个变量的NA的比例:

And to calculate the proportion of NAs for each variable:

mental$somatic_na <- rowMeans(is.na(mental [, c("var1", "var2", 
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11", 
"var12", "var13", "var14", "var15")]))

但是,当我尝试使用mutate()语句更改行数少于20%的行时,NA无法识别任何有效的代码.到目前为止,我已经尝试了很多排列,包括每个变量的以下内容:

However when I attempted the mutate() statement to alter the rows where fewer than 20% of values were NA I can't identify any code that works. I have tried a lot of permutations by this point, including the following for each variable:

mental_recode <- mental %>%
  rowwise() %>%
  mutate(var1 = if(somatic_na<0.2) 
  replace_na(list(var1= somatic_mean)))

哪个返回:

"no applicable method for 'replace_na' applied to an object of class "list""

,并尝试不使用mutate()一起完成所有操作:

and attempting to do them all together without using mutate():

mental %>%
  rowwise() %>%
  if(somatic_na<0.2)
                     replace_na(list(var1 = somatic_mean,   var2= 
somatic_mean,   var3 = somatic_mean,   var4 = somatic_mean,  var5 = 
somatic_mean,  var6 = somatic_mean,  var7 = somatic_mean, var8 = 
somatic_mean,  var9 = somatic_mean,  var10 = somatic_mean,   var11 = 
somatic_mean,  var12 = somatic_mean,   var13 = somatic_mean,  var14 = 
somatic_mean,  var15 = somatic_mean )) 

哪个返回:

Error in if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean,  : 
  argument is not interpretable as logical
In addition: Warning message:
In if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean,  :
  the condition has length > 1 and only the first element will be used

我还尝试将if_else()与mutate()结合使用,如果不满足条件,则将值设置为NA,但是在各种排列和错误消息出现后也无法使它起作用.

I also tried using if_else() in conjunction with mutate() and setting the value to NA if the condition was not met, but could not get that to work after various permutations and error messages either.

可以通过以下方式生成虚拟数据:

Dummy data can be generated by the following:

mental <- structure(list(id = 1:21, var1 = c(0L, 0L, 1L, 1L, 1L, 0L, 0L, 
                               NA, 0L, 0L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 0L, 0L), var2 = c(0L, 
 0L, 1L, 1L, 1L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 
2L, 0L, 1L, 1L), var3 = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 1L, 
1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 2L, 0L, 1L, 1L), var4 = c(1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 1L, 0L, 0L), var5 = c(0L, 0L, 0L, 1L, NA, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), var6 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), var7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, NA, 0L), var8 = c(0L, 
0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), var9 = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), var10 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 0L, NA, 0L), var11 = c(1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, NA, 0L), var12 = c(1L, 
0L, 1L, 1L, NA, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 1L, 1L), var13 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 
0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, NA, 0L), var14 = c(1L, 
0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 
2L, 0L, 1L, 0L), var15 = c(1L, 0L, 2L, NA, NA, 0L, NA, 0L, 0L, 
0L, 0L, 0L, NA, NA, 0L, NA, NA, NA, NA, NA, 0L)), .Names = c("id", 
"var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8", 
"var9", "var10", "var11", "var12", "var13", "var14", "var15"), class =                                 
"data.frame", row.names = c(NA, 
-21L))

有人知道适用于这种情况的代码吗?

Does anyone know of code that would work for this sort of situation?

提前谢谢!

推荐答案

以下是一种使用dplyr使用提供的数据帧在一个链中完成所有操作的方法.

Here is a way to do it all in one chain using dplyr using your supplied data frame.

首先创建一个所有感兴趣的列名的向量:

First create a vector of all column names of interest:

name_col <- colnames(mental)[2:16]

现在使用dplyr

library(dplyr)

mental %>% 
  # First create the column of row means
  mutate(somatic_mean = rowMeans(.[name_col], na.rm = TRUE)) %>% 
  # Now calculate the proportion of NAs
  mutate(somatic_na = rowMeans(is.na(.[name_col]))) %>% 
  # Create this column for filtering out later
  mutate(somatic_usable = ifelse(somatic_na < 0.2,
                                 "yes", "no")) %>% 
  # Make the following replacement on a row basis 
  rowwise() %>%
  mutate_at(vars(name_col), # Designate eligible columns to check for NAs
            funs(replace(., 
                         is.na(.) & somatic_na < 0.2, # Both conditions need to be met
                         somatic_mean))) %>% # What we are subbing the NAs with
  ungroup() # Now ungroup the 'rowwise' in case you need to modify further

现在,如果您只想选择NA少于20%的条目,则可以将以上内容传送到以下内容中:

Now, if you wanted to only select the entries that have less than 20% NAs, you can pipe the above into the following:

filter(somatic_usable == "yes")

还要注意,如果您想使条件小于或等于 20%,则需要用somatic_na <= 0.2替换两个somatic_na < 0.2.

Also of note, if you wanted to instead make the condition less than or equal to 20%, you would need to replace the two somatic_na < 0.2 with somatic_na <= 0.2.

希望这会有所帮助!

这篇关于如果行式NA的比例低于某个阈值,如何用行方式替换NA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆