Dplyr总结功能列表以及对其他数据列的依赖 [英] Dplyr summarise with list of function and dependence on other data column
问题描述
很抱歉,标题很糟糕,但是很难解释。我有以下数据和要汇总的函数:
library(tidyverse)
#生成数据
df<-map(1:4,〜runif(100))%&%;%
set_names(c(paste0('V',1:3),'threshold' ))%>%
as_tibble()%&%;%
mutate(group = sample(c('a','b'),100,replace = T))
#生成函数列表
fun_factory_params<-1:10
fun_factory<-function(param){
function(v,threshold){
sum((v *(阈值> = 1/2))^ param)
}
}
fun_list<-map(fun_factory_params,fun_factory)
df%> %head(n = 5)
V1 V2 V3阈值组
< dbl> < dbl> < dbl> < dbl> < chr>
1 0.631 0.0209 0.0360 0.713 b
2 0.629 0.674 0.174 0.693 b
3 0.144 0.358 0.439 0.395 a
4 0.0695 0.760 0.657 0.810 a
5 0.545 0.770 0.719 0.388 b
我想将 df
按 group
变量并汇总 V1
, V2
和 V3
的方式为:对于这些变量的每个 V
和每个值 n
在 fun_factory_params
(1至10)中,我要计算 sum((V *(阈值=> 1/2))^ n)
。为了以一种优雅的方式为每个 n
计算结果,我通过一个函数工厂创建了一个函数列表 fun_list
。 / p>
我尝试了以下操作,并得到了错误:
df%> ;%
group_by(group)%>%
summarise_at(vars(V1,V2,V3),fun_list,threshold = threshold)
list2(...中的错误):找不到对象阈值
我的问题来自阈值
变量。我找不到一种方法来使用自己构建的函数列表,并告诉R必须从每个数据组获取threshold参数。我尝试将阈值变量移至函数工厂的参数,并通过 purrr :: map $ c在
summarise_at
中构建函数列表$ c>调用,但是出现相同的问题。本质上,我总是以某种方式使R离开正确的环境以按组评估阈值。使用。$ threshold
返回整个数据的阈值变量,这样就不起作用。
但是,以下代码有效(但一次仅适用于给定的n值)这一事实使我认为有一种可以正确评估阈值的方法。
n<-1
df%&%;%
group_by(group)%>%
summarise_at(vars(V1,V2 ,V3),〜sum((。*(阈值> = 1/2))^ n))
有什么想法吗?
我找到了一种方法来设置阈值
在作为 summarise_at
函数的附加参数编写时,在正确的环境(分组数据)中进行评估:您需要引用阈值
与 quo
。
df%>%
group_by(group)%>%
summarise_at(vars(V1,V2,V3),fun_list,threshold = quo(threshold))
我不是100%的理解。我认为引用可以确保使用调用 quo
时所发现的环境来评估阈值,该环境就是分组数据(我们想要的)。从本质上讲,引用变量不仅使其带有名称,而且还设置了对我们希望用来评估该变量的环境的引用。在没有引用的情况下,阈值
的求值尝试在不存在变量的其他环境(不确定哪个...)中进行。可以在中找到有关 dplyr
中编程的一般信息。 。
请让我知道该解决方案是否仍然存在问题/不够可靠。
Sorry for the terrible title, but it's hard to explain. I have the following data and functions I want to summarize the data with:
library(tidyverse)
# generate data
df <- map(1:4, ~ runif(100)) %>%
set_names(c(paste0('V', 1:3), 'threshold')) %>%
as_tibble() %>%
mutate(group = sample(c('a', 'b'), 100, replace = T))
# generate function list
fun_factory_params <- 1:10
fun_factory <- function(param){
function(v, threshold){
sum((v * (threshold >= 1/2))^param)
}
}
fun_list <- map(fun_factory_params, fun_factory)
df %>% head(n = 5)
V1 V2 V3 threshold group
<dbl> <dbl> <dbl> <dbl> <chr>
1 0.631 0.0209 0.0360 0.713 b
2 0.629 0.674 0.174 0.693 b
3 0.144 0.358 0.439 0.395 a
4 0.0695 0.760 0.657 0.810 a
5 0.545 0.770 0.719 0.388 b
I want to group df
by the group
variable and summarize V1
, V2
and V3
in the following way: for each V
of those variables and each value n
in fun_factory_params
(1 to 10), I want to compute sum((V * (threshold >= 1/2))^n)
. To have results computed for each n
in an elegant way, I created a function list fun_list
through a function factory.
I tried the following and got the error:
df %>%
group_by(group) %>%
summarise_at(vars(V1,V2,V3), fun_list, threshold = threshold)
Error in list2(...) : object 'threshold' not found
My issue comes from the threshold
variable. I can't find a way to use the function list I build and tell R that the threshold argument has to be taken from each data group. I tried moving the threshold variable to a parameter of the function factory and to build the function list inside summarise_at
through a purrr::map
call, but I get the same issue. Essentially, the manipulations I make always somehow make R leave the right environment to evaluate threshold by group. Using .$threshold
returns the threshold variable for the entire data, so that does not work.
However, the fact that the following code works (but only for a given value of n at a time) makes me think that there is a way to evaluate threshold correctly.
n <- 1
df %>%
group_by(group) %>%
summarise_at(vars(V1,V2,V3), ~ sum((. * (threshold >= 1/2))^n))
Any ideas?
I found a way to have threshold
being evaluated in the right environment (grouped data) when written as an additional argument to summarise_at
functions: you need to quote threshold
with quo
.
df %>%
group_by(group) %>%
summarise_at(vars(V1,V2,V3), fun_list, threshold = quo(threshold))
I'm not 100% of my understanding. I think that quoting makes sure that threshold will be evaluated using the environment it was found in at the time quo
was called, which is the grouped data (what we want). Essentially, quoting a variable makes it carry not only its name, but also sets a reference to the environment we want that variable to be evaluated with. Without quoting, threshold
's evaluation was attempted in a different environment (not sure which one...) where the variable does not exist. General information about programming in dplyr
can be found here.
Please let me know if this solution still has something wrong / not robust.
这篇关于Dplyr总结功能列表以及对其他数据列的依赖的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!