R-在dplyr中使用group_by()和mutate()来应用函数,该向量返回组长度的向量 [英] R - use group_by() and mutate() in dplyr to apply function that returns a vector the length of groups
问题描述
获取以下示例数据:
set.seed(1)
foo <- data.frame(x=rnorm(10, 0, 10), y=rnorm(10, 0, 10), fac = c(rep("A", 5), rep("B", 5)))
我想通过变量"fac"将数据帧"foo"分为A和B,应用一个函数(马哈拉诺比斯距离)返回每个子组长度的向量,然后将输出变异回原始数据框.例如:
I want to split the dataframe "foo" by the variable "fac" into A's and B's, apply a function (mahalanobis distance) that returns a vector of the length of each subgroup, and then mutate the output back on to the original dataframe. For example:
auto.mahalanobis <- function(x) {
temp <- x[, c("x", "y")]
return(mahalanobis(temp, center = colMeans(temp, na.rm=T), cov = cov(temp,
use="pairwise.complete.obs")))
}
foo %>% group_by(fac) %>%
mutate(mahal = auto.mahalanobis(.))
哪个给出错误.显然,可以通过拆分数据集,应用函数并在将输出重新放回之前将输出添加为列来手动完成此过程.但是必须有一种更有效的方法来执行此操作(也许这是对dplyr的滥用?).
Which gives an error. Obviously this procedure can be done manually by splitting the dataset, applying the function, and adding the output as a column before putting it all back together again. But there must be a more efficient way to do this (perhaps this is a misuse of dplyr?).
推荐答案
如何改用 nest
:
foo %>%
group_by(fac) %>%
nest() %>%
mutate(mahal = map(data, ~mahalanobis(
.x,
center = colMeans(.x, na.rm = T),
cov = cov(.x, use = "pairwise.complete.obs")))) %>%
unnest()
## A tibble: 10 x 4
# fac mahal x y
# <fct> <dbl> <dbl> <dbl>
# 1 A 1.02 -6.26 15.1
# 2 A 0.120 1.84 3.90
# 3 A 2.81 -8.36 -6.21
# 4 A 2.84 16.0 -22.1
# 5 A 1.21 3.30 11.2
# 6 B 2.15 -8.20 -0.449
# 7 B 2.86 4.87 -0.162
# 8 B 1.23 7.38 9.44
# 9 B 0.675 5.76 8.21
#10 B 1.08 -3.05 5.94
此处避免使用形式为 temp<-x [,c("x","y)]
,因为您按 fac
分组后将 nest
相关列.然后,直接应用 mahalanobis
.
Here you avoid an explicit "x"
, "y"
filter of the form temp <- x[, c("x", "y")]
, as you nest
relevant columns after grouping by fac
. Applying mahalanobis
is then straight-forward.
要回复您的评论,这是一个 purrr
选项.由于轻松掌握正在发生的事情很容易,因此请逐步进行操作:
To respond to your comment, here is a purrr
option. Since it's easy to loose track of what's going on, let's go step-by-step:
-
使用另外一列生成样本数据.
Generate sample data with one additional column.
set.seed(1)
foo <- data.frame(
x = rnorm(10, 0, 10),
y = rnorm(10, 0, 10),
z = rnorm(10, 0, 10),
fac = c(rep("A", 5), rep("B", 5)))
我们现在将定义用于计算马氏距离的数据子集的列存储在列表
cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z"))
因此,我们将为 x
+ y
列中的数据子集计算马哈拉诺比斯距离(每个 fac
),然后分别为 y
+ z
. cols
的名称将用作两个距离向量的列名称.
So we will calculate the Mahalanobis distance (per fac
) for the subset of data in columns x
+y
and then separately for y
+z
. The names of cols
will be used as the column names of the two distance vectors.
现在是实际的 purrr
命令:
imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>%
mutate_all(function(lst) map(lst, ~mahalanobis(
.x,
center = colMeans(.x, na.rm = T),
cov = cov(., use = "pairwise.complete.obs")))) %>%
unnest() %>%
bind_cols(foo, .)
# x y z fac cols1 cols2
#1 -6.264538 15.1178117 9.1897737 A 1.0197542 1.3608052
#2 1.836433 3.8984324 7.8213630 A 0.1199607 1.1141352
#3 -8.356286 -6.2124058 0.7456498 A 2.8059562 1.5099574
#4 15.952808 -22.1469989 -19.8935170 A 2.8401953 3.0675228
#5 3.295078 11.2493092 6.1982575 A 1.2141337 0.9475794
#6 -8.204684 -0.4493361 -0.5612874 B 2.1517055 1.2284793
#7 4.874291 -0.1619026 -1.5579551 B 2.8626501 1.1724828
#8 7.383247 9.4383621 -14.7075238 B 1.2271316 2.5723023
#9 5.757814 8.2122120 -4.7815006 B 0.6746788 0.6939081
#10 -3.053884 5.9390132 4.1794156 B 1.0838341 2.3328276
简而言之,我们
- 循环遍历
cols
, 中的条目每个 -
nest
数据(基于cols
, 中定义的列) - 在嵌套和分组的数据上应用
mahalanobis
,生成与嵌套数据一样多的距离列,就像我们在cols
(即子集)中的条目一样,并且 - 最后
unnest
距离数据,并将其列绑定到原始的foo
数据.
fac
中的- loop over entries in
cols
, nest
data infoo
perfac
based on columns defined incols
,- apply
mahalanobis
on the nested and grouped data generating as many distance columns with nested data as we have entries incols
(i.e. subsets), and - finally
unnest
the distance data and column-bind it to the originalfoo
data.
这篇关于R-在dplyr中使用group_by()和mutate()来应用函数,该向量返回组长度的向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!