Dplyr总结功能列表以及对其他数据列的依赖 [英] Dplyr summarise with list of function and dependence on other data column

查看:87
本文介绍了Dplyr总结功能列表以及对其他数据列的依赖的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很抱歉,标题很糟糕,但是很难解释。我有以下数据和要汇总的函数:

  library(tidyverse)

#生成数据
df<-map(1:4,〜runif(100))%&%;%
set_names(c(paste0('V',1:3),'threshold' ))%>%
as_tibble()%&%;%
mutate(group = sample(c('a','b'),100,replace = T))

#生成函数列表
fun_factory_params<-1:10
fun_factory<-function(param){
function(v,threshold){
sum((v *(阈值> = 1/2))^ param)
}
}
fun_list<-map(fun_factory_params,fun_factory)

df%> %head(n = 5)
V1 V2 V3阈值组
< dbl> < dbl> < dbl> < dbl> < chr>
1 0.631 0.0209 0.0360 0.713 b
2 0.629 0.674 0.174 0.693 b
3 0.144 0.358 0.439 0.395 a
4 0.0695 0.760 0.657 0.810 a
5 0.545 0.770 0.719 0.388 b

我想将 df group 变量并汇总 V1 V2 V3 的方式为:对于这些变量的每个 V 和每个值 n fun_factory_params (1至10)中,我要计算 sum((V *(阈值=> 1/2))^ n) 。为了以一种优雅的方式为每个 n 计算结果,我通过一个函数工厂创建了一个函数列表 fun_list 。 / p>

我尝试了以下操作,并得到了错误:

  df%> ;%
group_by(group)%>%
summarise_at(vars(V1,V2,V3),fun_list,threshold = threshold)

list2(...中的错误):找不到对象阈值

我的问题来自阈值变量。我找不到一种方法来使用自己构建的函数列表,并告诉R必须从每个数据组获取threshold参数。我尝试将阈值变量移至函数工厂的参数,并通过 purrr :: map summarise_at 中构建函数列表$ c>调用,但是出现相同的问题。本质上,我总是以某种方式使R离开正确的环境以按组评估阈值。使用。$ threshold 返回整个数据的阈值变量,这样就不起作用。



但是,以下代码有效(但一次仅适用于给定的n值)这一事实使我认为有一种可以正确评估阈值的方法。

  n<-1 
df%&%;%
group_by(group)%>%
summarise_at(vars(V1,V2 ,V3),〜sum((。*(阈值> = 1/2))^ n))

有什么想法吗?

解决方案

我找到了一种方法来设置阈值在作为 summarise_at 函数的附加参数编写时,在正确的环境(分组数据)中进行评估:您需要引用阈值 quo

  df%>%
group_by(group)%>%
summarise_at(vars(V1,V2,V3),fun_list,threshold = quo(threshold))

我不是100%的理解。我认为引用可以确保使用调用 quo 时所发现的环境来评估阈值,该环境就是分组数据(我们想要的)。从本质上讲,引用变量不仅使其带有名称,而且还设置了对我们希望用来评估该变量的环境的引用。在没有引用的情况下,阈值的求值尝试在不存在变量的其他环境(不确定哪个...)中进行。可以在中找到有关 dplyr 中编程的一般信息。



请让我知道该解决方案是否仍然存在问题/不够可靠。


Sorry for the terrible title, but it's hard to explain. I have the following data and functions I want to summarize the data with:

library(tidyverse)

# generate data
df <- map(1:4, ~ runif(100)) %>% 
  set_names(c(paste0('V', 1:3), 'threshold')) %>% 
  as_tibble() %>% 
  mutate(group = sample(c('a', 'b'), 100, replace = T))

# generate function list
fun_factory_params <- 1:10
fun_factory <- function(param){
  function(v, threshold){
    sum((v * (threshold >= 1/2))^param)
  }
}
fun_list <- map(fun_factory_params, fun_factory)

df %>% head(n = 5)
      V1     V2     V3 threshold group
   <dbl>  <dbl>  <dbl>     <dbl> <chr>
1 0.631  0.0209 0.0360     0.713 b    
2 0.629  0.674  0.174      0.693 b    
3 0.144  0.358  0.439      0.395 a    
4 0.0695 0.760  0.657      0.810 a    
5 0.545  0.770  0.719      0.388 b    

I want to group df by the group variable and summarize V1, V2 and V3 in the following way: for each V of those variables and each value n in fun_factory_params (1 to 10), I want to compute sum((V * (threshold >= 1/2))^n). To have results computed for each n in an elegant way, I created a function list fun_list through a function factory.

I tried the following and got the error:

df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), fun_list, threshold = threshold)

Error in list2(...) : object 'threshold' not found

My issue comes from the threshold variable. I can't find a way to use the function list I build and tell R that the threshold argument has to be taken from each data group. I tried moving the threshold variable to a parameter of the function factory and to build the function list inside summarise_at through a purrr::map call, but I get the same issue. Essentially, the manipulations I make always somehow make R leave the right environment to evaluate threshold by group. Using .$threshold returns the threshold variable for the entire data, so that does not work.

However, the fact that the following code works (but only for a given value of n at a time) makes me think that there is a way to evaluate threshold correctly.

n <- 1
df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), ~ sum((. * (threshold >= 1/2))^n))

Any ideas?

解决方案

I found a way to have threshold being evaluated in the right environment (grouped data) when written as an additional argument to summarise_at functions: you need to quote threshold with quo.

df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), fun_list, threshold = quo(threshold))

I'm not 100% of my understanding. I think that quoting makes sure that threshold will be evaluated using the environment it was found in at the time quo was called, which is the grouped data (what we want). Essentially, quoting a variable makes it carry not only its name, but also sets a reference to the environment we want that variable to be evaluated with. Without quoting, threshold's evaluation was attempted in a different environment (not sure which one...) where the variable does not exist. General information about programming in dplyr can be found here.

Please let me know if this solution still has something wrong / not robust.

这篇关于Dplyr总结功能列表以及对其他数据列的依赖的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆