使用dplyr分组功能的子功能 [英] Sub-function in grouping function using dplyr

查看:176
本文介绍了使用dplyr分组功能的子功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用dpylr包来计算每个变量的子组的缺失值。

I'm using the dpylr package to count missing values for subgroups for each of my variables.

我使用了一个迷你功能:

I used a mini-function:

NAobs <- function(x) length(x[is.na(x)]) ####function to count missing data for variables

计算缺失值。
因为我有一些变量,我想添加一些更多的信息(每组的样本大小和每组缺少数据的百分比)我写了以下代码,并插入一个变量(task_1)来检查它。

to count missing values. Because I have quite some variables and I wanted to add a bit more information (sample size per group, and percentage of missing data per group) I wrote the following code, and inserted one variable (task_1) to check it.

library(dplyr)
group_by(DataRT, class) %>%
  summarise(class_size=length(class), missing = NAobs(task_1), perc.= missing/class_size)

很好,我收到一张这样的表:

This works very well and I receive a table like this:

   class class_size missing      perc.
   (dbl)      (int)   (int)      (dbl)
1      1         25       2 0.08000000
2      2         25       1 0.04000000
3      3         25       3 0.12000000
4      4         25       4 0.16000000
5      5         24       3 0.12500000
6      6         29       6 0.20689655
...

在下一步中,我想将我的命令概括为一个函数:

In the next step, I wanted to generalize my command by including it into a function:

missing<-function(x, print=TRUE){
            group_by(DataRT, class) %>%
                    summarise(class_size=length(class), 
                        missing = NAobs(x),
                        perc.= missing/class_size)}

最好,我现在可以写错(task_1)和将得到相同的表,而NAobs(x)忽略分组变量(类),并且我收到一个这样的表:

Optimally, I now could write missing(task_1) and would get the same table, but instead NAobs(x) ignores the grouping variable (class) and I receive a table like this:

   class class_size missing    perc.
   (dbl)      (int)   (int)    (dbl)
1      1         25      59 2.360000
2      2         25      59 2.360000
3      3         25      59 2.360000
4      4         25      59 2.360000
5      5         24      59 2.458333
6      6         29      59 2.034483
...

所以发生的情况是,列missing仅显示task_1的NA个案总数,忽略组;并用NAobs(变量名称)替换NAobs(x)来解决这个问题会破坏首先编写函数的目的。如何计算每组丢失案例的数量,而不必复制代码并每次更改变量名称?谢谢!

So what happens is that the column "missing" only shows the total number of NA cases for task_1, ignoring the groups; and replacing NAobs(x) with NAobs(variable name) to fix this issue would ruin the purpose of writing a function in the first place. How could I calculate the number of missing cases per group without having to copy the code and changing the variable name each time? Thank you!

推荐答案

新的dplyr更新。最新的dplyr将能够通过两个新功能 !! 来解决这个问题。第一个引用的输入如替换将会,第二个引用它来评估它。有关使用dplyr进行编程的更多信息,请查看此小插曲

New dplyr update. The newest dplyr will be able to solve this with two new functions enquo and !!. The first quotes the input like substitute would, the second unquotes it for evaluation. For more on programming with dplyr, see this vignette

您将需要开发人员的dplyr版本,我也建议< a href =https://github.com/hadley/rlang =nofollow noreferrer>最新的rang安装

#install developer's version until new release in May
library(dplyr) #0.5.0.9004+

#Setup
set.seed(143)
NAobs <- function(x) length(x[is.na(x)])
DataRT <- data.frame(class = sample(1:6, 25, TRUE), task1 = sample(c(NA,1), 25, TRUE),
                     task2 = sample(c(NA,1), 25, TRUE))
f <- function(x) {
  my_var <- enquo(x)
  group_by(DataRT, class) %>%
    summarise(class_size=length(class), 
    missing = NAobs(!!my_var),
    perc.= missing/class_size)
}
f(task1)
# # A tibble: 6 × 4
#   class class_size missing     perc.
#   <int>      <int>   <int>     <dbl>
# 1     1          5       0 0.0000000
# 2     2          4       2 0.5000000
# 3     3          3       0 0.0000000
# 4     4          1       0 0.0000000
# 5     5          5       3 0.6000000
# 6     6          7       3 0.4285714

这篇关于使用dplyr分组功能的子功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆