自定义功能:允许未知数量的组进行操作 [英] Custom function: allow unknown number of groups for operations

查看:66
本文介绍了自定义功能:允许未知数量的组进行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在自定义函数中,如何在允许未知数量的组的同时避免为每个组重复相同的代码?

这是一个更简单的示例,但假设该函数具有大量运算,例如为每个组计算不同的统计信息并将其粘贴在每个ggplot方面.抱歉,我发现很难简化功能来演示这一特定挑战.

  test.function<-函数(变量,组,数据){if(!require(dplyr)){install.packages("dplyr"))}if(!require(ggplot2)){install.packages("ggplot2"))}if(!require(ggrepel)){install.packages("ggrepel"))}图书馆(dplyr)库(ggplot2)要求(ggrepel)data $ variable<-data [,variable]data $ group<-factor(data [,group])#计算单个组的统计信息数据%&%;%filter(data $ group == levels(data $ group)[1])%&%;%选择(变量)%&%不列出%&%;%shapiro.test()->pshapiro.1<-回合(shap $ p.value,3)数据%&%;%filter(data $ group == levels(data $ group)[2])%&%;%选择(变量)%&%不列出%&%;%shapiro.test()->pshapiro.2&l​​t;-回合(shap $ p.value,3)数据%&%;%filter(data $ group == levels(data $ group)[3])%&%;%选择(变量)%&%不列出%&%;%shapiro.test()->pshapiro.3<-舍入(shap $ p.value,3)#使ggplot的统计数据帧dat_text<-data.frame(组=级别(数据$组),文字= c(shapiro.1,shapiro.2,shapiro.3))#绘制情节ggplot(数据,aes(x =变量,填充=组))+geom_density()+facet_grid(group〜.)+geom_text_repel(data = dat_text,映射= aes(x = Inf,y = Inf,标签=文字))} 

如果有三组,则可以工作

  test.function("mpg","cyl",mtcars) 

如果有两个群组,则不起作用

  test.function("mpg","vs",mtcars)shapiro.test(.)中的错误:样本大小必须在3到5000之间 

如果群组超过三个,则不起作用

  test<-mtcars%>%mutate(new = rep(1:4,8))test.function("mpg","new",test)data.frame中的错误(组=级别(数据$组),文本= c(shapiro.1,shapiro.2,:参数暗示不同的行数:4、3 

程序员通常在这些功能中用来容纳任意数量的组的技巧是什么?

解决方案

在评论中要求我解释这里的想法,所以我想我将扩展原始答案,该答案显示在下面的水平规则下方./p>

主要问题是如何对未知数量的组进行某些操作.有很多不同的方法可以做到这一点.无论采用哪种方式,您都需要具有能够识别组数并适应该数字的功能.例如,您可以执行以下代码.在这里,我确定数据中的唯一组,初始化所需的结果,然后遍历所有组.我没有使用此策略,因为与 dplyr 代码相比,for循环有些笨拙.

  un_group<-na.omit(unique(data [[group]]))dat_text<-data.frame(group = un_group,文字=不适用)for(i in 1:length(un_group)){tmp<-data [which(data [[group]] == ungroup [i]),]dat_text $ text [i]<-as.character(round(shaprio.test(tmp [[variable]])$ p.value,3))} 

要记住的另一件事是要进行很好的扩展.您提到了代码最终将要执行的许多操作.在下面的内容中,我只是 summary 打印了一个数字.但是,您可以编写一个可以生成数据集的小函数,然后 summarise 可以返回许多结果.例如,考虑:

  myfun<-函数(x){s = shapiro.test(x)data.frame(p = s $ p.value,stat = s $ statistic,均值=均值(x,na.rm = TRUE),sd = sd(x,na.rm = TRUE),歪斜= DescTools :: Skew(x,na.rm = TRUE),峰度= DescTools :: Kurt(x,na.rm = TRUE))}mtcars%>%group_by(cyl)%&%;%summarise(myfun(mpg))##小动作:3 x 7#cyl p stat平均sd偏度峰度#*< dbl>< dbl>< dbl>< dbl>< dbl>< dbl>< dbl>#1 4 0.261 0.912 26.7 4.51 0.259 -1.65#2 6 0.325 0.899 19.7 1.45 -0.158 -1.91#3 8 0.323 0.932 15.1 2.56 -0.363 -0.566 

在上面的函数中,我让该函数返回带有几个不同变量的数据帧.只需调用 summary ,即可为每个组的变量返回所有这些结果.使用for循环或类似 sapply()之类的东西当然可以实现,但是我喜欢 dplyr 代码的读取方式更好.而且,根据您拥有的组数, dplyr 代码的伸缩性要比一些基本的R语言更好.

我真的很想在输出中反映输入(即输入变量名称)-因此,我想找到一种方法来避开使变量称为 group variable . aes_string()规范是这样做的一种方法,然后使用变量名构建公式是另一种方法.我最近遇到了 reformulate()函数,它是一种比 paste() as.formula()我以前使用过.

这些是我在回答问题时正在考虑的事情.


  test.function<-函数(变量,组,数据){if(!require(dplyr)){install.packages("dplyr"))}if(!require(ggplot2)){install.packages("ggplot2"))}if(!require(ggrepel)){install.packages("ggrepel"))}图书馆(dplyr)库(ggplot2)要求(ggrepel)#计算单个组的统计信息data [[group]]<-as.factor(data [[group]])dat_text<-数据%>%group_by(.data [[group]])%&%;%summarise(text = shapiro.test(.data [[variable]])$ p.value)%>%mutate(text = as.character(round(text,3)))gform<-重新格式化(.",response = group)#绘制情节ggplot(数据,aes_string(x =变量,填充=组))+geom_density()+facet_grid(gform)+geom_text_repel(data = dat_text,映射= aes(x = Inf,y = Inf,标签=文字))}test.function("mpg","vs",mtcars) 

  test.function("mpg","cyl",mtcars) 

Within a custom function, how can I avoid repeating the same code for each group while allowing an unknown number of groups?

Here's a simpler example but assume the function has tons of operations, like calculating different statistics for each group and sticking them on each ggplot facet. Sorry, I find it difficult to make a simpler function to demonstrate this specific challenge.

test.function <- function(variable, group, data) {
  if(!require(dplyr)){install.packages("dplyr")}
  if(!require(ggplot2)){install.packages("ggplot2")}
  if(!require(ggrepel)){install.packages("ggrepel")}
  library(dplyr)
  library(ggplot2)
  require(ggrepel)
  data$variable <- data[,variable]
  data$group <- factor(data[,group])

  # Compute individual group stats
  data %>%
    filter(data$group==levels(data$group)[1]) %>%
    select(variable) %>%
    unlist %>%
    shapiro.test() -> shap
  shapiro.1 <- round(shap$p.value,3)
  data %>%
    filter(data$group==levels(data$group)[2]) %>%
    select(variable) %>%
    unlist %>%
    shapiro.test() -> shap
  shapiro.2 <- round(shap$p.value,3)
  data %>%
    filter(data$group==levels(data$group)[3]) %>%
    select(variable) %>%
    unlist %>%
    shapiro.test() -> shap
  shapiro.3 <- round(shap$p.value,3)

  # Make the stats dataframe for ggplot
  dat_text <- data.frame(
    group = levels(data$group),
    text = c(shapiro.1, shapiro.2, shapiro.3))

  # Make the plot
  ggplot(data, aes(x=variable, fill=group)) +
    geom_density() +
    facet_grid(group ~ .) +
    geom_text_repel(data = dat_text,
                    mapping = aes(x = Inf, 
                                  y = Inf, 
                                  label = text))
}

Works if there's three groups

test.function("mpg", "cyl", mtcars)

Doesn't work if there's two groups

test.function("mpg", "vs", mtcars)

 Error in shapiro.test(.) : sample size must be between 3 and 5000 

Doesn't work if there's more than three groups

test <- mtcars %>% mutate(new = rep(1:4, 8))
test.function("mpg", "new", test)

 Error in data.frame(group = levels(data$group), text = c(shapiro.1, shapiro.2,  : 
  arguments imply differing number of rows: 4, 3 

What is the trick programmers usually use to accommodate any number of groups in such functions?

解决方案

I was asked in the comments to explain the thinking here, so I thought I would expand on the original answer, which shows up below the horizontal rule below.

The main question is how to do some operation on an unknown number of groups. There are lots of different ways to do that. In any of the ways, you need the function to be able to identify the number of groups and adapt to that number. For example, you could do something like the code below. There, I identify the unique groups in the data, initialize the required result and then loop over all of the groups. I didn't use this strategy because the for loop feels a bit clunky compared to the dplyr code.

un_group <- na.omit(unique(data[[group]]))
dat_text <- data.frame(group = un_group, 
                     text = NA)
for(i in 1:length(un_group)){
  tmp <- data[which(data[[group]] == ungroup[i]), ]
  dat_text$text[i] <- as.character(round(shaprio.test(tmp[[variable]])$p.value, 3))
}

The other thing to keep in mind is what's going to scale well. You mentioned that you've got lots of operations the code will ultimately do. In what's below, I just had summarise print a single number. However, you could write a little function that would produce a dataset and then summarise can return a number of results. For example, consider:

myfun <- function(x){
  s = shapiro.test(x)
  data.frame(p = s$p.value, stat=s$statistic, 
             mean = mean(x, na.rm=TRUE), 
             sd = sd(x, na.rm=TRUE), 
             skew = DescTools::Skew(x, na.rm=TRUE), 
             kurtosis = DescTools::Kurt(x, na.rm=TRUE))
  
}
mtcars %>% group_by(cyl) %>% summarise(myfun(mpg))
# # A tibble: 3 x 7
#     cyl     p  stat  mean    sd   skew kurtosis
# * <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
# 1     4 0.261 0.912  26.7  4.51  0.259   -1.65 
# 2     6 0.325 0.899  19.7  1.45 -0.158   -1.91 
# 3     8 0.323 0.932  15.1  2.56 -0.363   -0.566

In the function above, I had the function return a data frame with several different variables. A single call to summarise returns all of those results for the variable for each group. This would certainly have been possible using a for loop or something like sapply(), but I like how the dplyr code reads a bit better. And, depending on how many groups you have, the dplyr code scales a bit better than some of the base R stuff.

I really like trying to reflect the inputs (i.e., input variable names) in the outputs - so I wanted to find a way to get around making variables called group and variable in the data. The aes_string() specification is one way of doing that and then building a formula using the variable names is another. I recently just encountered the reformulate() function, which is a more robust way of building formulae than the combination of paste() and as.formula() I was using before.

Those were the things I was thinking about when I was answering the question.


test.function <- function(variable, group, data) {
  if(!require(dplyr)){install.packages("dplyr")}
  if(!require(ggplot2)){install.packages("ggplot2")}
  if(!require(ggrepel)){install.packages("ggrepel")}
  library(dplyr)
  library(ggplot2)
  require(ggrepel)

  # Compute individual group stats
  
  data[[group]] <- as.factor(data[[group]])
  
  dat_text <- data %>% group_by(.data[[group]]) %>% 
    summarise(text=shapiro.test(.data[[variable]])$p.value) %>% 
    mutate(text=as.character(round(text, 3)))
  
  gform <- reformulate(".", response=group)
  # Make the plot
  ggplot(data, aes_string(x=variable, fill=group)) +
    geom_density() +
    facet_grid(gform) +
    geom_text_repel(data = dat_text,
                    mapping = aes(x = Inf, 
                                  y = Inf, 
                                  label = text))
}
test.function("mpg", "vs", mtcars)

test.function("mpg", "cyl", mtcars)

这篇关于自定义功能:允许未知数量的组进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆