R-在dplyr中使用group_by()和mutate()来应用函数,该向量返回组长度的向量 [英] R - use group_by() and mutate() in dplyr to apply function that returns a vector the length of groups

查看:45
本文介绍了R-在dplyr中使用group_by()和mutate()来应用函数,该向量返回组长度的向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

获取以下示例数据:

set.seed(1)

foo <- data.frame(x=rnorm(10, 0, 10), y=rnorm(10, 0, 10), fac = c(rep("A", 5), rep("B", 5)))

我想通过变量"fac"将数据帧"foo"分为A和B,应用一个函数(马哈拉诺比斯距离)返回每个子组长度的向量,然后将输出变异回原始数据框.例如:

I want to split the dataframe "foo" by the variable "fac" into A's and B's, apply a function (mahalanobis distance) that returns a vector of the length of each subgroup, and then mutate the output back on to the original dataframe. For example:

auto.mahalanobis <- function(x) {
  temp <- x[, c("x", "y")]
  return(mahalanobis(temp, center = colMeans(temp, na.rm=T), cov = cov(temp, 
use="pairwise.complete.obs")))
}

foo %>% group_by(fac) %>%
  mutate(mahal = auto.mahalanobis(.))

哪个给出错误.显然,可以通过拆分数据集,应用函数并在将输出重新放回之前将输出添加为列来手动完成此过程.但是必须有一种更有效的方法来执行此操作(也许这是对dplyr的滥用?).

Which gives an error. Obviously this procedure can be done manually by splitting the dataset, applying the function, and adding the output as a column before putting it all back together again. But there must be a more efficient way to do this (perhaps this is a misuse of dplyr?).

推荐答案

如何改用 nest :

foo %>%
    group_by(fac) %>%
    nest() %>%
    mutate(mahal = map(data, ~mahalanobis(
        .x,
        center = colMeans(.x, na.rm = T),
        cov = cov(.x, use = "pairwise.complete.obs")))) %>%
    unnest()
## A tibble: 10 x 4
#   fac   mahal      x       y
#   <fct> <dbl>  <dbl>   <dbl>
# 1 A     1.02   -6.26  15.1
# 2 A     0.120   1.84   3.90
# 3 A     2.81   -8.36  -6.21
# 4 A     2.84   16.0  -22.1
# 5 A     1.21    3.30  11.2
# 6 B     2.15   -8.20  -0.449
# 7 B     2.86    4.87  -0.162
# 8 B     1.23    7.38   9.44
# 9 B     0.675   5.76   8.21
#10 B     1.08   -3.05   5.94

此处避免使用形式为 temp<-x [,c("x","y)] ,因为您按 fac 分组后将 nest 相关列.然后,直接应用 mahalanobis .

Here you avoid an explicit "x", "y" filter of the form temp <- x[, c("x", "y")], as you nest relevant columns after grouping by fac. Applying mahalanobis is then straight-forward.

要回复您的评论,这是一个 purrr 选项.由于轻松掌握正在发生的事情很容易,因此请逐步进行操作:

To respond to your comment, here is a purrr option. Since it's easy to loose track of what's going on, let's go step-by-step:

  1. 使用另外一列生成样本数据.

  1. Generate sample data with one additional column.

set.seed(1)
foo <- data.frame(
    x = rnorm(10, 0, 10),
    y = rnorm(10, 0, 10),
    z = rnorm(10, 0, 10),
    fac = c(rep("A", 5), rep("B", 5)))

  • 我们现在将定义用于计算马氏距离的数据子集的列存储在列表

    cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z"))
    

    因此,我们将为 x + y 列中的数据子集计算马哈拉诺比斯距离(每个 fac ),然后分别为 y + z . cols 的名称将用作两个距离向量的列名称.

    So we will calculate the Mahalanobis distance (per fac) for the subset of data in columns x+y and then separately for y+z. The names of cols will be used as the column names of the two distance vectors.

    现在是实际的 purrr 命令:

    imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>%
        mutate_all(function(lst) map(lst, ~mahalanobis(
            .x,
            center = colMeans(.x, na.rm = T),
            cov = cov(., use = "pairwise.complete.obs")))) %>%
        unnest() %>%
        bind_cols(foo, .)
    #           x           y           z fac     cols1     cols2
    #1  -6.264538  15.1178117   9.1897737   A 1.0197542 1.3608052
    #2   1.836433   3.8984324   7.8213630   A 0.1199607 1.1141352
    #3  -8.356286  -6.2124058   0.7456498   A 2.8059562 1.5099574
    #4  15.952808 -22.1469989 -19.8935170   A 2.8401953 3.0675228
    #5   3.295078  11.2493092   6.1982575   A 1.2141337 0.9475794
    #6  -8.204684  -0.4493361  -0.5612874   B 2.1517055 1.2284793
    #7   4.874291  -0.1619026  -1.5579551   B 2.8626501 1.1724828
    #8   7.383247   9.4383621 -14.7075238   B 1.2271316 2.5723023
    #9   5.757814   8.2122120  -4.7815006   B 0.6746788 0.6939081
    #10 -3.053884   5.9390132   4.1794156   B 1.0838341 2.3328276
    

    简而言之,我们

    1. 循环遍历 cols
    2. 中的条目每个 fac 中的
    3. nest 数据(基于 cols
    4. 中定义的列)
    5. 在嵌套和分组的数据上应用 mahalanobis ,生成与嵌套数据一样多的距离列,就像我们在 cols (即子集)中的条目一样,并且
    6. 最后 unnest 距离数据,并将其列绑定到原始的 foo 数据.
    1. loop over entries in cols,
    2. nest data in foo per fac based on columns defined in cols,
    3. apply mahalanobis on the nested and grouped data generating as many distance columns with nested data as we have entries in cols (i.e. subsets), and
    4. finally unnest the distance data and column-bind it to the original foo data.

  • 这篇关于R-在dplyr中使用group_by()和mutate()来应用函数,该向量返回组长度的向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆