总结具有不同功能的不同列的简洁方法 [英] Succinct way to summarize different columns with different functions

查看:81
本文介绍了总结具有不同功能的不同列的简洁方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题基于类似的问题,它施加了一个附加约束,每个变量的名称应该只出现一次。



考虑数据框

  library(tidyverse)
df<-tibble(潜在的_long_name_i_dont_want_to_type_twice = 1:10,
another_annoyingly_long_name = 21:30)

我想将平均值应用于第一列,并将 sum



正如我上面链接的问题所示,总结允许您执行此操作,但要求每列的名称出现两次。另一方面, summarize_at 允许您简洁地将多个函数应用于多个列,但这可以通过在上调用 all 指定的函数来实现所有指定的列,而不是以一对一的方式进行。有没有办法结合 summarize summarize_at 这些独特的功能?



我能够用 rlang 对其进行破解,但是我不确定它是否比将每个变量键入两次都更干净:

  v<-c( potentially_long_name_i_dont_want_to_type_twice,
another_annoyingly_long_name)
f<-list(mean,sum)

##所需的输出
smrz<-set_names(v)%>%map(sym)%&%;%map2(f,〜rlang :: call2(.y,.x ))
df%>%summary(!!! smrz)
##小玩意儿:1 x 2
#潜在的_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#< dbl> < int>
#1 5.5 255

编辑以解决一些哲学问题



我认为不想避免 x = f(x)惯用语是不合理的。我可能对输入长名称有些过分的热衷,但是真正的问题实际上是(相对)长名称彼此非常相似。例子包括核苷酸序列(例如 AGCCAGCGGAAACAGTAAGG )和 TCGA条形码。在这种情况下,不仅自动完成功能有限,而且编写 AGCCAGCGGAAACAGTAAGG = sum(AGCCAGCGGAAACAGTAAGG)之类的东西会引入不必要的耦合,并增加分配的两边可能意外发生的风险



我完全同意@MrFlick关于 dplyr 不断增加的代码的观点。可读性,但我不认为可读性应该以正确性为代价。像 summarize_at mutate_at 之类的函数非常出色,因为它们在将运算符放置在其操作数旁边达到了完美的平衡(清晰度)并确保将结果写入正确的列(正确性)。



通过相同的标记,我认为删除变量提及的建议解决方案在另一个方向。尽管本质上很聪明-我当然很欣赏它们所节省的额外输入-我认为,通过消除函数和变量名之间的关联,这样的解决方案现在依赖于正确的变量排序,这会产生意外错误的风险。 / p>

简而言之,我认为自我变异/自我总结操作应只提及每个变量名称一次。

解决方案

我提出了2个技巧来解决此问题,请在底部查看两种解决方案的代码和一些详细信息:



A函数 .at 返回变量组(这里是每个组中只有一个变量)的结果,然后我们可以将其取消拼接,因此我们受益于两个世界,总结 summarize_at

  df %>%summary(
!!!。at(vars(potentially_long_name_i_dont_want_to_type_twice),Mean),
!!!。at(vars(another_annoyingly_long_name),sum))

##tibb le:1 x 2
#潜在地_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#< dbl> < dbl>
#1 5.5 255

总结的副词

  df%&%;%
..flx $ summarize( potential_long_name_i_dont_want_to_type_twice =〜((。),
another_annoyingly_long_name =〜sum(。))

##小标题:1 x 2
#潜在的_long_name_i_dont_want_to_type_twice $ b_ly_ly dbl> < int>
#1 5.5 255






.at

的代码

它必须在管道中使用,因为它使用在父级环境中,比较混乱,但可以正常工作。

  .at<-function(.vars,.funs,...){
in_a_piped_fun<-存在(。,parent.frame())&&
length(ls(envir = parent.frame(),all.names = TRUE))== 1
if(!in_a_piped_fun)
stop(。at()管道函数的参数)
.tbl--try(eval.parent(quote(。)))
dplyr ::: manip_at(
.tbl,.vars,。 funs,rlang :: enquo(.funs),rlang :::: caller_env(),
.include_group_vars = TRUE,...)
}

我设计了它来结合 summerize summarize_at

  df%>%summary(
!!!。at(vars(potentially_long_name_i_dont_want_to_type_twice),list(foo = min,bar = max)),
!!!。at(vars(another_annoyingly_long_name),中位数))

##动作:1 x 3
#foo bar another_annoyingly_long_name
#< dbl> < dbl> < dbl>
#1 1 10 25.5






< .. flx

的strong>代码

.. flx 输出一个函数,该函数通过调用 a = purrr :: as_mapper(来替换其公式参数,例如 a =〜mean(。)在运行之前〜mean(。))(a)。方便使用 summerize mutate ,因为列不能是公式,所以不会有任何冲突。



我喜欢使用美元符号作为简写,并以 .. 开头,因此我可以将这些标签命名为((并给他们一个类 tag ),然后将它们视为不同的对象(仍对此进行试验)。 .. flx(summarize)(...)也可以。

  .. flx<-function(fun){
function(...){
mc<-match.call()
mc [[1]]< ;-tail(mc [[1]],1)[[1]]
mc []<-imap(mc,〜if(is.call(。)&& same(。[[ 1]],quote(`〜`))){
rlang :: expr(purrr :: as_mapper(!!。)(!! sym(.y)))
} else。)
eval.parent(mc)
}
}

class(.. flx)<-标记

`$。 tag`<-function(e1,e2){
#更改原始调用,因此x $ y(即$ .tag`(tag = x,data = y)变为x(y)
mc<-match.call()
mc [[1]]<-mc [[2]]
mc [[2]]<-NULL
names(mc )<-NULL
#在父环境
中评估它eval.parent(mc)
}


My question builds on a similar one by imposing an additional constraint that the name of each variable should appear only once.

Consider a data frame

library( tidyverse )
df <- tibble( potentially_long_name_i_dont_want_to_type_twice = 1:10,
              another_annoyingly_long_name = 21:30 )

I would like to apply mean to the first column and sum to the second column, without unnecessarily typing each column name twice.

As the question I linked above shows, summarize allows you to do this, but requires that the name of each column appears twice. On the other hand, summarize_at allows you to succinctly apply multiple functions to multiple columns, but it does so by calling all specified functions on all specified columns, instead of doing it in a one-to-one fashion. Is there a way to combine these distinct features of summarize and summarize_at?

I was able to hack it with rlang, but I'm not sure if it's any cleaner than just typing each variable twice:

v <- c("potentially_long_name_i_dont_want_to_type_twice",
       "another_annoyingly_long_name")
f <- list(mean,sum)

## Desired output
smrz <- set_names(v) %>% map(sym) %>% map2( f, ~rlang::call2(.y,.x) )
df %>% summarize( !!!smrz )
# # A tibble: 1 x 2
#   potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                             <dbl>                        <int>
# 1                                             5.5                          255

EDIT to address some philosophical points

I don’t think that wanting to avoid the x=f(x) idiom is unreasonable. I probably came across a bit overzealous about typing long names, but the real issue is actually having (relatively) long names that are very similar to each other. Examples include nucleotide sequences (e.g., AGCCAGCGGAAACAGTAAGG) and TCGA barcodes. Not only is autocomplete of limited utility in such cases, but writing things like AGCCAGCGGAAACAGTAAGG = sum( AGCCAGCGGAAACAGTAAGG ) introduces unnecessary coupling and increases the risk that the two sides of the assignment might accidentally go out of sync as the code is developed and maintained.

I completely agree with @MrFlick about dplyr increasing code readability, but I don’t think that readability should come at the cost of correctness. Functions like summarize_at and mutate_at are brilliant, because they strike a perfect balance between placing operations next to their operands (clarity) and guaranteeing that the result is written to the correct column (correctness).

By the same token, I feel that the proposed solutions which remove variable mention altogether swing too far in the other direction. While inherently clever -- and I certainly appreciate the extra typing they save -- I think that, by removing the association between functions and variable names, such solutions now rely on proper ordering of variables, which creates its own risks of accidental errors.

In short, I believe that a self-mutating / self-summarizing operation should mention each variable name exactly once.

解决方案

I propose 2 tricks to solve this issue, see the code and some details for both solutions at the bottom :

A function .at that returns results for for groups of variables (here only one variable by group) that we can then unsplice, so we benefit from both worlds, summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), mean),
  !!!.at(vars(another_annoyingly_long_name), sum))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <dbl>
#   1                                             5.5                          255

An adverb to summarize, with a dollar notation shorthand.

df %>%
  ..flx$summarize(potentially_long_name_i_dont_want_to_type_twice = ~mean(.),
                  another_annoyingly_long_name = ~sum(.))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <int>
#   1                                             5.5                          255


code for .at

It has to be used in a pipe because it uses the . in the parent environment, messy but it works.

.at <- function(.vars, .funs, ...) {
  in_a_piped_fun <- exists(".",parent.frame()) &&
    length(ls(envir=parent.frame(), all.names = TRUE)) == 1
  if (!in_a_piped_fun)
    stop(".at() must be called as an argument to a piped function")
  .tbl <- try(eval.parent(quote(.)))
  dplyr:::manip_at(
    .tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
    .include_group_vars = TRUE, ...)
}

I designed it to combine summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), list(foo=min, bar = max)),
  !!!.at(vars(another_annoyingly_long_name), median))

# # A tibble: 1 x 3
#       foo   bar another_annoyingly_long_name
#     <dbl> <dbl>                        <dbl>
#   1     1    10                         25.5


code for ..flx

..flx outputs a function that replaces its formula arguments such as a = ~mean(.) by calls a = purrr::as_mapper(~mean(.))(a) before running. Convenient with summarize and mutate because a column cannot be a formula so there can't be any conflict.

I like to use the dollar notation as a shorthand and to have names starting with .. so I can name those "tags" (and give them a class "tag") and see them as different objects (still experimenting with this). ..flx(summarize)(...) will work as well though.

..flx <- function(fun){
  function(...){
    mc <- match.call()
    mc[[1]] <- tail(mc[[1]],1)[[1]]
    mc[] <- imap(mc,~if(is.call(.) && identical(.[[1]],quote(`~`))) {
      rlang::expr(purrr::as_mapper(!!.)(!!sym(.y))) 
    } else .)
    eval.parent(mc)
  }
}

class(..flx) <- "tag"

`$.tag` <- function(e1, e2){
  # change original call so x$y, which is `$.tag`(tag=x, data=y), becomes x(y)
  mc <- match.call()
  mc[[1]] <- mc[[2]]
  mc[[2]] <- NULL
  names(mc) <- NULL
  # evaluate it in parent env
  eval.parent(mc)
}

这篇关于总结具有不同功能的不同列的简洁方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆