使用dplyr摘要对多个列进行不同的操作 [英] Using dplyr summarize with different operations for multiple columns

查看:83
本文介绍了使用dplyr摘要对多个列进行不同的操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,我知道已经有很多相关问题,但是没有一个问题可以满足我的特殊需求。



我想在上面使用dplyr summaryize一张有50列的表格,我需要对它们应用不同的汇总函数。将不同的函数应用于变量的不同子组。



作为示例,假设虹膜数据集将有50列,因此我们不想按名称寻址列。我想要前两列的总和,第三列的均值和所有其余列的第一个值(在group_by(Species)之后)。我该怎么办?

解决方案

正如其他人所提到的,通常通过调用 summarize_each来完成 / summarize_at / summarize_if 对于要应用汇总功能的每组列至。据我所知,您将必须创建一个自定义函数来对每个子集进行汇总。例如,您可以使用以下方式设置名称,即可以使用选择助手(例如 contains())仅过滤您要想要将此功能应用于。如果不是,则可以设置要汇总的特定列号。



对于您提到的示例,您可以尝试以下操作:

 汇总器<-函数(tb,colsone,colstwo,colsthree,
funsone,funstwo,funsthree,group_name){

return(bind_cols(
summary_all(select(tb, colsone)、. funs = funsone),
summary_all(select(tb,colstwo)、. funs = funstwo)%&%;%
ungroup()%&%;%select(-matches(group_name)) ,
summary_all(select(tb,colsthree),.funs = funsthree)%&%;%
ungroup()%&%;%select(-matches(group_name))
))

}

#同名
虹膜%&%;%as.tibble()%&%;%
group_by(Species)%&%;%
summaryr(colsone = contains( Sepal),
colstwo = matches( Petal.Length),
colsthree = c(-contains( Sepal),-matches( Petal。长度)),
funsone =总和,
funstwo =平均值,
funsthree =第一,
group_name =物种)

#具有索引
虹膜%&%;%as.tibble()%&%;%
group_by(物种)%&%;%
summaryr(colsone = 1:2,
colstwo = 3,
colsthree = 4,
funsone = sum,
funstwo =平均值,
funsthree = first,
group_name = Species)


Well, I know that there are already tons of related questions, but none gave an answer to my particular need.

I want to use dplyr "summarize" on a table with 50 columns, and I need to apply different summary functions to these.

"Summarize_all" and "summarize_at" both seem to have the disadvantage that it's not possible to apply different functions to different subgroups of variables.

As an example, let's assume the iris dataset would have 50 columns, so we do not want to address columns by names. I want the sum over the first two columns, the mean over the third and the first value for all remaining columns (after a group_by(Species)). How could I do this?

解决方案

As other people have mentioned, this is normally done by calling summarize_each / summarize_at / summarize_if for every group of columns that you want to apply the summarizing function to. As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example set the colnames in such way that you can use the select helpers (e.g. contains()) to filter just the columns that you want to apply the function to. If not, then you can set the specific column numbers that you want to summarize.

For the example you mentioned, you could try the following:

summarizer <- function(tb, colsone, colstwo, colsthree, 
                       funsone, funstwo, funsthree, group_name) {

return(bind_cols(
    summarize_all(select(tb, colsone), .funs = funsone),
    summarize_all(select(tb, colstwo), .funs = funstwo) %>% 
      ungroup() %>% select(-matches(group_name)),
    summarize_all(select(tb, colsthree), .funs = funsthree) %>% 
      ungroup() %>% select(-matches(group_name)) 
))

}

#With colnames
iris %>% as.tibble() %>% 
  group_by(Species) %>% 
  summarizer(colsone = contains("Sepal"), 
         colstwo = matches("Petal.Length"), 
         colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
         funsone = "sum", 
         funstwo = "mean",
         funsthree = "first",
         group_name = "Species")

#With indexes
iris %>% as.tibble() %>% 
 group_by(Species) %>% 
 summarizer(colsone = 1:2, 
         colstwo = 3, 
         colsthree = 4,
         funsone = "sum", 
         funstwo = "mean",
         funsthree = "first",
         group_name = "Species")

这篇关于使用dplyr摘要对多个列进行不同的操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆