dplyr总结函数返回值是矢量值的时间吗? [英] dplyr summarise when function return is vector-valued?
问题描述
dplyr :: summarize()
函数可以对数据应用任意函数,但似乎该函数必须返回标量值。我很好奇是否有一种合理的方法来处理不返回向量值而无需多次调用该函数的函数。
The dplyr::summarize()
function can apply arbitrary functions over the data, but it seems that function must return a scalar value. I'm curious if there is a reasonable way to handle functions that return a vector value without making multiple calls to the function.
这是一个愚蠢的最小示例。考虑一个给出多个值的函数,例如:
Here's a somewhat silly minimal example. Consider a function that gives multiple values, such as:
f <- function(x,y){
coef(lm(x ~ y, data.frame(x=x,y=y)))
}
和类似以下数据:
df <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'), x=rnorm(12,1,1), y=rnorm(12,1,1))
I' d喜欢做类似的事情:
I'd like to do something like:
df %>%
group_by(group) %>%
summarise(f(x,y))
并返回添加了2列的表对于每个返回值,而不是通常的1列。而是出现以下错误:期望单个值
and get back a table that has 2 columns added for each of the returned values instead of the usual 1 column. Instead, this errors with: Expecting single value
当然我们可以从 dlpyr :: summarise()
通过多次给出函数参数:
Of course we can get multiple values from dlpyr::summarise()
by giving the function argument multiple times:
f1 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[1]]
f2 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[2]]
df %>%
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))
这将提供所需的输出:
group a b
1 A 1.7957245 -0.339992915
2 B 0.5283379 -0.004325209
3 C 1.0797647 -0.074393457
但是以这种方式进行编码非常荒谬和丑陋。
but coding in this way is ridiculously crude and ugly.
data.table
可以更简洁地处理这种情况:
data.table
handles this case more succinctly:
dt <- as.data.table(df)
dt[, f(x,y), by="group"]
但创建的输出使用其他行而不是其他列来扩展表,从而导致输出既混乱又难于工作与:
but creates an output that extend the table using additional rows instead of additional columns, resulting in an output that is both confusing and harder to work with:
group V1
1: A 1.795724536
2: A -0.339992915
3: B 0.528337890
4: B -0.004325209
5: C 1.079764710
6: C -0.074393457
当然还有更多经典的应用
策略可以在这里使用,
Of course there are more classic apply
strategies we could use here,
sapply(levels(df$group), function(x) coef(lm(x~y, df[df$group == x, ])))
A B C
(Intercept) 1.7957245 0.528337890 1.07976471
y -0.3399929 -0.004325209 -0.07439346
但是这既牺牲了优雅,又怀疑了分组的速度。特别要注意的是,在这种情况下,我们不能使用预定义的函数 f
,而必须将分组硬编码到函数定义中。
but this sacrifices both the elegance and I suspect the speed of the grouping. In particular, note that we cannot use our pre-defined function f
in this case, but have to hard code the grouping into the function definition.
是否有用于处理这种情况的 dplyr
函数?如果不是,是否有更优雅的方法来处理按组对data.frame上的矢量值函数进行求值的过程?
Is there a dplyr
function for handling this case? If not, is there a more elegant way to handle this process of evaluating vector-valued functions over a data.frame by group?
推荐答案
您可以尝试 do
library(dplyr)
df %>%
group_by(group) %>%
do(setNames(data.frame(t(f(.$x, .$y))), letters[1:2]))
# group a b
#1 A 0.8983217 -0.04108092
#2 B 0.8945354 0.44905220
#3 C 1.2244023 -1.00715248
基于 f1
和 f2
是
df %>%
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))
# group a b
#1 A 0.8983217 -0.04108092
#2 B 0.8945354 0.44905220
#3 C 1.2244023 -1.00715248
更新
如果使用 data.table
,获得相似结果的选项是
Update
If you are using data.table
, the option to get similar result is
library(data.table)
setnames(setDT(df)[, as.list(f(x,y)) , group], 2:3, c('a', 'b'))[]
这篇关于dplyr总结函数返回值是矢量值的时间吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!