有条件地选择dplyr中某些值的比例为NA的列 [英] Conditionally selecting columns in dplyr where certain proportion of values is NA
问题描述
我正在使用类似于下面生成的 data.frame
的数据集:
I'm working with a data set resembling the data.frame
generated below:
set.seed(1)
dta <- data.frame(observation = 1:20,
valueA = runif(n = 20),
valueB = runif(n = 20),
valueC = runif(n = 20),
valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA
各列的 NA
值,最后一列的值大于60%的观测值 NAs
。
The columns have NA
values with the last column having more than 60% of observations NAs
.
> sapply(dta, function(x) {table(is.na(x))})
$observation
FALSE
20
$valueA
FALSE
20
$valueB
FALSE TRUE
16 4
$valueC
FALSE TRUE
11 9
$valueD
FALSE TRUE
6 14
问题
我希望能够删除 dplyr
管道以某种方式将其传递给 select
参数。
Problem
I would like to be able to remove this column in dplyr
pipe line somehow passing it to the select
argument.
可以很容易地在 base
中完成此操作。例如,选择具有少于 50% NAs
的列,我可以这样做:
This can be easily done in base
. For example to select columns with less than 50% NAs
I can do:
dta[, colSums(is.na(dta)) < nrow(dta) / 2]
会产生:
> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
observation valueA valueB valueC
1 1 0.2655087 0.9347052 0.8209463
2 2 0.3721239 NA NA
任务
我有兴趣在 dplyr
管道中实现相同的灵活性:
Task
I'm interested in achieving the same flexibility in dplyr
pipe line:
Vectorize(require)(package = c("dplyr", # Data manipulation
"magrittr"), # Reverse pipe
char = TRUE)
dta %<>%
# Some transformations I'm doing on the data
mutate_each(funs(as.numeric)) %>%
# I want my select to take place here
推荐答案
也许?
dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head
# observation valueA valueB valueC
#1 1 0.2655087 0.9347052 0.8209463
#2 2 0.3721239 NA NA
#3 3 0.5728534 NA NA
#4 4 0.9082078 NA NA
#5 5 0.2016819 NA NA
#6 6 0.8983897 0.3861141 NA
已更新,其中包含 colMeans
而不是 colSums
,这意味着您不再需要除以行数。
Updated with colMeans
instead of colSums
which means you don't need to divide by the number of rows any more.
,在基本R中,您还可以使用 colMeans
:
And, just for the record, in base R you could also use colMeans
:
dta[,colMeans(is.na(dta)) < 0.5]
这篇关于有条件地选择dplyr中某些值的比例为NA的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!