自定义功能:允许未知数量的组进行操作 [英] Custom function: allow unknown number of groups for operations
问题描述
在自定义函数中,如何在允许未知数量的组的同时避免为每个组重复相同的代码?
这是一个更简单的示例,但假设该函数具有大量运算,例如为每个组计算不同的统计信息并将其粘贴在每个ggplot方面.抱歉,我发现很难简化功能来演示这一特定挑战.
test.function<-函数(变量,组,数据){if(!require(dplyr)){install.packages("dplyr"))}if(!require(ggplot2)){install.packages("ggplot2"))}if(!require(ggrepel)){install.packages("ggrepel"))}图书馆(dplyr)库(ggplot2)要求(ggrepel)data $ variable<-data [,variable]data $ group<-factor(data [,group])#计算单个组的统计信息数据%&%;%filter(data $ group == levels(data $ group)[1])%&%;%选择(变量)%&%不列出%&%;%shapiro.test()->pshapiro.1<-回合(shap $ p.value,3)数据%&%;%filter(data $ group == levels(data $ group)[2])%&%;%选择(变量)%&%不列出%&%;%shapiro.test()->pshapiro.2<-回合(shap $ p.value,3)数据%&%;%filter(data $ group == levels(data $ group)[3])%&%;%选择(变量)%&%不列出%&%;%shapiro.test()->pshapiro.3<-舍入(shap $ p.value,3)#使ggplot的统计数据帧dat_text<-data.frame(组=级别(数据$组),文字= c(shapiro.1,shapiro.2,shapiro.3))#绘制情节ggplot(数据,aes(x =变量,填充=组))+geom_density()+facet_grid(group〜.)+geom_text_repel(data = dat_text,映射= aes(x = Inf,y = Inf,标签=文字))}
如果有三组,则可以工作
test.function("mpg","cyl",mtcars)
如果有两个群组,则不起作用
test.function("mpg","vs",mtcars)shapiro.test(.)中的错误:样本大小必须在3到5000之间
如果群组超过三个,则不起作用
test<-mtcars%>%mutate(new = rep(1:4,8))test.function("mpg","new",test)data.frame中的错误(组=级别(数据$组),文本= c(shapiro.1,shapiro.2,:参数暗示不同的行数:4、3
程序员通常在这些功能中用来容纳任意数量的组的技巧是什么?
在评论中要求我解释这里的想法,所以我想我将扩展原始答案,该答案显示在下面的水平规则下方./p>
主要问题是如何对未知数量的组进行某些操作.有很多不同的方法可以做到这一点.无论采用哪种方式,您都需要具有能够识别组数并适应该数字的功能.例如,您可以执行以下代码.在这里,我确定数据中的唯一组,初始化所需的结果,然后遍历所有组.我没有使用此策略,因为与 dplyr
代码相比,for循环有些笨拙.
un_group<-na.omit(unique(data [[group]]))dat_text<-data.frame(group = un_group,文字=不适用)for(i in 1:length(un_group)){tmp<-data [which(data [[group]] == ungroup [i]),]dat_text $ text [i]<-as.character(round(shaprio.test(tmp [[variable]])$ p.value,3))}
要记住的另一件事是要进行很好的扩展.您提到了代码最终将要执行的许多操作.在下面的内容中,我只是 summary
打印了一个数字.但是,您可以编写一个可以生成数据集的小函数,然后 summarise
可以返回许多结果.例如,考虑:
myfun<-函数(x){s = shapiro.test(x)data.frame(p = s $ p.value,stat = s $ statistic,均值=均值(x,na.rm = TRUE),sd = sd(x,na.rm = TRUE),歪斜= DescTools :: Skew(x,na.rm = TRUE),峰度= DescTools :: Kurt(x,na.rm = TRUE))}mtcars%>%group_by(cyl)%&%;%summarise(myfun(mpg))##小动作:3 x 7#cyl p stat平均sd偏度峰度#*< dbl>< dbl>< dbl>< dbl>< dbl>< dbl>< dbl>#1 4 0.261 0.912 26.7 4.51 0.259 -1.65#2 6 0.325 0.899 19.7 1.45 -0.158 -1.91#3 8 0.323 0.932 15.1 2.56 -0.363 -0.566
在上面的函数中,我让该函数返回带有几个不同变量的数据帧.只需调用 summary
,即可为每个组的变量返回所有这些结果.使用for循环或类似 sapply()
之类的东西当然可以实现,但是我喜欢 dplyr
代码的读取方式更好.而且,根据您拥有的组数, dplyr
代码的伸缩性要比一些基本的R语言更好.
我真的很想在输出中反映输入(即输入变量名称)-因此,我想找到一种方法来避开使变量称为 group
和 variable 代码>.
aes_string()
规范是这样做的一种方法,然后使用变量名构建公式是另一种方法.我最近遇到了 reformulate()
函数,它是一种比 paste()
和 as.formula()
我以前使用过.
这些是我在回答问题时正在考虑的事情.
test.function<-函数(变量,组,数据){if(!require(dplyr)){install.packages("dplyr"))}if(!require(ggplot2)){install.packages("ggplot2"))}if(!require(ggrepel)){install.packages("ggrepel"))}图书馆(dplyr)库(ggplot2)要求(ggrepel)#计算单个组的统计信息data [[group]]<-as.factor(data [[group]])dat_text<-数据%>%group_by(.data [[group]])%&%;%summarise(text = shapiro.test(.data [[variable]])$ p.value)%>%mutate(text = as.character(round(text,3)))gform<-重新格式化(.",response = group)#绘制情节ggplot(数据,aes_string(x =变量,填充=组))+geom_density()+facet_grid(gform)+geom_text_repel(data = dat_text,映射= aes(x = Inf,y = Inf,标签=文字))}test.function("mpg","vs",mtcars)
test.function("mpg","cyl",mtcars)
Within a custom function, how can I avoid repeating the same code for each group while allowing an unknown number of groups?
Here's a simpler example but assume the function has tons of operations, like calculating different statistics for each group and sticking them on each ggplot facet. Sorry, I find it difficult to make a simpler function to demonstrate this specific challenge.
test.function <- function(variable, group, data) {
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(ggrepel)){install.packages("ggrepel")}
library(dplyr)
library(ggplot2)
require(ggrepel)
data$variable <- data[,variable]
data$group <- factor(data[,group])
# Compute individual group stats
data %>%
filter(data$group==levels(data$group)[1]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.1 <- round(shap$p.value,3)
data %>%
filter(data$group==levels(data$group)[2]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.2 <- round(shap$p.value,3)
data %>%
filter(data$group==levels(data$group)[3]) %>%
select(variable) %>%
unlist %>%
shapiro.test() -> shap
shapiro.3 <- round(shap$p.value,3)
# Make the stats dataframe for ggplot
dat_text <- data.frame(
group = levels(data$group),
text = c(shapiro.1, shapiro.2, shapiro.3))
# Make the plot
ggplot(data, aes(x=variable, fill=group)) +
geom_density() +
facet_grid(group ~ .) +
geom_text_repel(data = dat_text,
mapping = aes(x = Inf,
y = Inf,
label = text))
}
Works if there's three groups
test.function("mpg", "cyl", mtcars)
Doesn't work if there's two groups
test.function("mpg", "vs", mtcars)
Error in shapiro.test(.) : sample size must be between 3 and 5000
Doesn't work if there's more than three groups
test <- mtcars %>% mutate(new = rep(1:4, 8))
test.function("mpg", "new", test)
Error in data.frame(group = levels(data$group), text = c(shapiro.1, shapiro.2, :
arguments imply differing number of rows: 4, 3
What is the trick programmers usually use to accommodate any number of groups in such functions?
I was asked in the comments to explain the thinking here, so I thought I would expand on the original answer, which shows up below the horizontal rule below.
The main question is how to do some operation on an unknown number of groups. There are lots of different ways to do that. In any of the ways, you need the function to be able to identify the number of groups and adapt to that number. For example, you could do something like the code below. There, I identify the unique groups in the data, initialize the required result and then loop over all of the groups. I didn't use this strategy because the for loop feels a bit clunky compared to the dplyr
code.
un_group <- na.omit(unique(data[[group]]))
dat_text <- data.frame(group = un_group,
text = NA)
for(i in 1:length(un_group)){
tmp <- data[which(data[[group]] == ungroup[i]), ]
dat_text$text[i] <- as.character(round(shaprio.test(tmp[[variable]])$p.value, 3))
}
The other thing to keep in mind is what's going to scale well. You mentioned that you've got lots of operations the code will ultimately do. In what's below, I just had summarise
print a single number. However, you could write a little function that would produce a dataset and then summarise
can return a number of results. For example, consider:
myfun <- function(x){
s = shapiro.test(x)
data.frame(p = s$p.value, stat=s$statistic,
mean = mean(x, na.rm=TRUE),
sd = sd(x, na.rm=TRUE),
skew = DescTools::Skew(x, na.rm=TRUE),
kurtosis = DescTools::Kurt(x, na.rm=TRUE))
}
mtcars %>% group_by(cyl) %>% summarise(myfun(mpg))
# # A tibble: 3 x 7
# cyl p stat mean sd skew kurtosis
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 0.261 0.912 26.7 4.51 0.259 -1.65
# 2 6 0.325 0.899 19.7 1.45 -0.158 -1.91
# 3 8 0.323 0.932 15.1 2.56 -0.363 -0.566
In the function above, I had the function return a data frame with several different variables. A single call to summarise
returns all of those results for the variable for each group. This would certainly have been possible using a for loop or something like sapply()
, but I like how the dplyr
code reads a bit better. And, depending on how many groups you have, the dplyr
code scales a bit better than some of the base R stuff.
I really like trying to reflect the inputs (i.e., input variable names) in the outputs - so I wanted to find a way to get around making variables called group
and variable
in the data. The aes_string()
specification is one way of doing that and then building a formula using the variable names is another. I recently just encountered the reformulate()
function, which is a more robust way of building formulae than the combination of paste()
and as.formula()
I was using before.
Those were the things I was thinking about when I was answering the question.
test.function <- function(variable, group, data) {
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(ggrepel)){install.packages("ggrepel")}
library(dplyr)
library(ggplot2)
require(ggrepel)
# Compute individual group stats
data[[group]] <- as.factor(data[[group]])
dat_text <- data %>% group_by(.data[[group]]) %>%
summarise(text=shapiro.test(.data[[variable]])$p.value) %>%
mutate(text=as.character(round(text, 3)))
gform <- reformulate(".", response=group)
# Make the plot
ggplot(data, aes_string(x=variable, fill=group)) +
geom_density() +
facet_grid(gform) +
geom_text_repel(data = dat_text,
mapping = aes(x = Inf,
y = Inf,
label = text))
}
test.function("mpg", "vs", mtcars)
test.function("mpg", "cyl", mtcars)
这篇关于自定义功能:允许未知数量的组进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!