有条件地选择dplyr中某些值的比例为NA的列 [英] Conditionally selecting columns in dplyr where certain proportion of values is NA

查看:78
本文介绍了有条件地选择dplyr中某些值的比例为NA的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用类似于下面生成的 data.frame 的数据集:

I'm working with a data set resembling the data.frame generated below:

set.seed(1)
dta <- data.frame(observation = 1:20,
                  valueA = runif(n = 20),
                  valueB = runif(n = 20),
                  valueC = runif(n = 20),
                  valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA

各列的 NA 值,最后一列的值大于60%的观测值 NAs

The columns have NA values with the last column having more than 60% of observations NAs.

> sapply(dta, function(x) {table(is.na(x))})
$observation

FALSE 
   20 

$valueA

FALSE 
   20 

$valueB

FALSE  TRUE 
   16     4 

$valueC

FALSE  TRUE 
   11     9 

$valueD

FALSE  TRUE 
    6    14 



问题



我希望能够删除 dplyr 管道以某种方式将其传递给 select 参数。

Problem

I would like to be able to remove this column in dplyr pipe line somehow passing it to the select argument.

可以很容易地在 base 中完成此操作。例如,选择具有少于 50% NAs 的列,我可以这样做:

This can be easily done in base. For example to select columns with less than 50% NAs I can do:

dta[, colSums(is.na(dta)) < nrow(dta) / 2]

会产生:

> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
  observation    valueA    valueB    valueC
1           1 0.2655087 0.9347052 0.8209463
2           2 0.3721239        NA        NA






任务



我有兴趣在 dplyr 管道中实现相同的灵活性:


Task

I'm interested in achieving the same flexibility in dplyr pipe line:

Vectorize(require)(package = c("dplyr",         # Data manipulation
                               "magrittr"),     # Reverse pipe

char = TRUE)

dta %<>%
  # Some transformations I'm doing on the data
  mutate_each(funs(as.numeric)) %>% 
  # I want my select to take place here


推荐答案

也许?

dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head
#  observation    valueA    valueB    valueC
#1           1 0.2655087 0.9347052 0.8209463
#2           2 0.3721239        NA        NA
#3           3 0.5728534        NA        NA
#4           4 0.9082078        NA        NA
#5           5 0.2016819        NA        NA
#6           6 0.8983897 0.3861141        NA

已更新,其中包含 colMeans 而不是 colSums ,这意味着您不再需要除以行数。

Updated with colMeans instead of colSums which means you don't need to divide by the number of rows any more.

,在基本R中,您还可以使用 colMeans

And, just for the record, in base R you could also use colMeans:

dta[,colMeans(is.na(dta)) < 0.5]

这篇关于有条件地选择dplyr中某些值的比例为NA的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆