分组,然后计算缺失的变量? [英] group by and then count missing variables?
问题描述
我的数据看起来像这样:
My data looks something like this:
df1 <- data.frame(
Z = sample(LETTERS[1:5], size = 10000, replace = T),
X1 = sample(c(1:10,NA), 10000, replace = T),
X2 = sample(c(1:25,NA), 10000, replace = T),
X3 = sample(c(1:5,NA), 10000, replace = T)
)
我可以用以下方法计算缺失的变量:
I can count the missing variables with:
data.frame("Total Missing" = colSums(is.na(df1)))
但是,我想通过 Z
来实现。也就是说,每个Z值缺少X1-3的数量。
But, I would like to this by Z
. That is, the number of missing X1-3 for each value of Z.
我尝试了此操作
df1 %>% group_by(Z) %>% summarise('Total Missing' = colSums(is.na(df1)))
,但无法正常工作。
推荐答案
您可以使用 summarise_each
:
df1 %>%
group_by(Z) %>%
summarise_each(funs(sum(is.na(.))))
#Source: local data frame [5 x 4]
#
# Z X1 X2 X3
# (fctr) (int) (int) (int)
#1 A 169 77 334
#2 B 170 77 316
#3 C 159 78 348
#4 D 181 79 326
#5 E 174 69 341
请注意,您可以在 summarise_each
中指定要应用该功能的哪些列(默认为除分组列以外的所有列)或该功能应不应用于。您可能还很感兴趣的注意到,像 summarise_each
到 summarise
一样,还有 mutate_each
作为 mutate
的补充,如果您想对所有列应用函数而不汇总结果。
Note that you can specify inside summarise_each
which columns to apply the function to (default is all columns except grouping columns) or which columns the function should not be applied to. It may also be interesting for you to note that like summarise_each
to summarise
, there's also mutate_each
as the complement to mutate
if you want to apply functions to all columns without summarising the result.
强制性data.table等效项是:
The obligatory data.table equivalent is:
library(data.table)
setDT(df1)[, lapply(.SD, function(x) sum(is.na(x))), by = Z]
# Z X1 X2 X3
#1: D 181 79 326
#2: C 159 78 348
#3: B 170 77 316
#4: A 169 77 334
#5: E 174 69 341
在基数R中,您可以使用以下拆分/应用/组合方法:
And in base R you could use a split/apply/combine approach like the following:
do.call(rbind,
lapply(
split(df1, df1$Z), function(dd) {
colSums(is.na(dd[-1]))
}))
# X1 X2 X3
#A 169 77 334
#B 170 77 316
#C 159 78 348
#D 181 79 326
#E 174 69 341
或者在基数R中,也可以使用 aggregate
:
Or, also in base R, you can use aggregate
:
aggregate(df1[-1], list(df1$Z), FUN = function(y) sum(is.na(y)))
aggregate(. ~ Z, df1, FUN = function(y) sum(is.na(y)), na.action = "na.pass") # formula interface
这篇关于分组,然后计算缺失的变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!