分组,然后计算缺失的变量? [英] group by and then count missing variables?

查看:73
本文介绍了分组,然后计算缺失的变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据看起来像这样:

My data looks something like this:

df1 <- data.frame(
  Z = sample(LETTERS[1:5], size = 10000, replace = T),
  X1 = sample(c(1:10,NA), 10000, replace = T),
  X2 = sample(c(1:25,NA), 10000, replace = T),
  X3 = sample(c(1:5,NA), 10000, replace = T)
)

我可以用以下方法计算缺失的变量:

I can count the missing variables with:

data.frame("Total Missing" = colSums(is.na(df1))) 

但是,我想通过 Z 来实现。也就是说,每个Z值缺少X1-3的数量。

But, I would like to this by Z. That is, the number of missing X1-3 for each value of Z.

我尝试了此操作

df1 %>% group_by(Z) %>% summarise('Total Missing' = colSums(is.na(df1)))

,但无法正常工作。

推荐答案

您可以使用 summarise_each

df1 %>% 
  group_by(Z) %>% 
  summarise_each(funs(sum(is.na(.))))
#Source: local data frame [5 x 4]
#
#       Z    X1    X2    X3
#  (fctr) (int) (int) (int)
#1      A   169    77   334
#2      B   170    77   316
#3      C   159    78   348
#4      D   181    79   326
#5      E   174    69   341

请注意,您可以在 summarise_each 中指定要应用该功能的哪些列(默认为除分组列以外的所有列)或该功能应应用于。您可能还很感兴趣的注意到,像 summarise_each summarise 一样,还有 mutate_each 作为 mutate 的补充,如果您想对所有列应用函数而不汇总结果。

Note that you can specify inside summarise_each which columns to apply the function to (default is all columns except grouping columns) or which columns the function should not be applied to. It may also be interesting for you to note that like summarise_each to summarise, there's also mutate_each as the complement to mutate if you want to apply functions to all columns without summarising the result.

强制性data.table等效项是:

The obligatory data.table equivalent is:

library(data.table)
setDT(df1)[, lapply(.SD, function(x) sum(is.na(x))), by = Z]
#   Z  X1 X2  X3
#1: D 181 79 326
#2: C 159 78 348
#3: B 170 77 316
#4: A 169 77 334
#5: E 174 69 341

在基数R中,您可以使用以下拆分/应用/组合方法:

And in base R you could use a split/apply/combine approach like the following:

do.call(rbind,
        lapply(
          split(df1, df1$Z), function(dd) {
            colSums(is.na(dd[-1]))
          }))
#   X1 X2  X3
#A 169 77 334
#B 170 77 316
#C 159 78 348
#D 181 79 326
#E 174 69 341

或者在基数R中,也可以使用 aggregate

Or, also in base R, you can use aggregate:

aggregate(df1[-1], list(df1$Z), FUN = function(y) sum(is.na(y))) 
aggregate(. ~ Z, df1, FUN = function(y) sum(is.na(y)), na.action = "na.pass") # formula interface

这篇关于分组,然后计算缺失的变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆