有条件地选择 dplyr 中特定比例的值为 NA 的列 [英] Conditionally selecting columns in dplyr where certain proportion of values is NA

查看：19 发布时间：2021/12/23 16:04:26 r filter dataframe dplyr na

本文介绍了有条件地选择 dplyr 中特定比例的值为 NA 的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用类似于下面生成的 data.frame 的数据集:

I'm working with a data set resembling the data.frame generated below:

set.seed(1)
dta <- data.frame(observation = 1:20,
                  valueA = runif(n = 20),
                  valueB = runif(n = 20),
                  valueC = runif(n = 20),
                  valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA

这些列有 NA 值，最后一列有超过 60% 的观察值 NA.

The columns have NA values with the last column having more than 60% of observations NAs.

> sapply(dta, function(x) {table(is.na(x))})
$observation

FALSE 
   20 

$valueA

FALSE 
   20 

$valueB

FALSE  TRUE 
   16     4 

$valueC

FALSE  TRUE 
   11     9 

$valueD

FALSE  TRUE 
    6    14

问题

我希望能够删除 dplyr 管道中的这一列，以某种方式将其传递给 select 参数.

Problem

I would like to be able to remove this column in dplyr pipe line somehow passing it to the select argument.

这可以在 base 中轻松完成.例如，选择小于 50% NAs 的列，我可以这样做:

This can be easily done in base. For example to select columns with less than 50% NAs I can do:

dta[, colSums(is.na(dta)) < nrow(dta) / 2]

产生:

> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
  observation    valueA    valueB    valueC
1           1 0.2655087 0.9347052 0.8209463
2           2 0.3721239        NA        NA

<小时>

任务

我有兴趣在 dplyr 管道中实现相同的灵活性:

Task

I'm interested in achieving the same flexibility in dplyr pipe line:

Vectorize(require)(package = c("dplyr",         # Data manipulation
                               "magrittr"),     # Reverse pipe

char = TRUE)

dta %<>%
  # Some transformations I'm doing on the data
  mutate_each(funs(as.numeric)) %>% 
  # I want my select to take place here

推荐答案

也许是这样?

dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head
#  observation    valueA    valueB    valueC
#1           1 0.2655087 0.9347052 0.8209463
#2           2 0.3721239        NA        NA
#3           3 0.5728534        NA        NA
#4           4 0.9082078        NA        NA
#5           5 0.2016819        NA        NA
#6           6 0.8983897 0.3861141        NA

更新使用 colMeans 而不是 colSums 这意味着您不再需要除以行数.

Updated with colMeans instead of colSums which means you don't need to divide by the number of rows any more.

而且，只是为了记录，在基础 R 中，您还可以使用 colMeans:

And, just for the record, in base R you could also use colMeans:

dta[,colMeans(is.na(dta)) < 0.5]

这篇关于有条件地选择 dplyr 中特定比例的值为 NA 的列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有条件地选择 dplyr 中特定比例的值为 NA 的列 [英] Conditionally selecting columns in dplyr where certain proportion of values is NA

问题描述

问题

Problem

任务

Task

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有条件地选择 dplyr 中特定比例的值为 NA 的列 [英] Conditionally selecting columns in dplyr where certain proportion of values is NA

问题描述

问题

Problem

任务

Task

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭