dplyr :: mutate_at()依赖于具有给定前缀/后缀的多列 [英] dplyr::mutate_at() relying on multiple columns with a given prefix/suffix

查看:66
本文介绍了dplyr :: mutate_at()依赖于具有给定前缀/后缀的多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

dplyr :: mutate_at()可用于将同一函数应用于多个列。它还允许您使用命名列表在新列中设置结果。

dplyr::mutate_at() can be used to apply the same function to multiple columns. It also allows you to set the results in new columns using a named list.

但是,如果我有成对的许多列(例如, data1_a data1_b data2_a data2_b ,...),我想将这些对相乘?可以吗?

However, what if I have many columns in pairs (say, data1_a, data1_b, data2_a, data2_b, ...) and I want to multiply those pairs together? Is that possible?

用手,看起来像是

suppressPackageStartupMessages({
  library(dplyr)
})

data.frame(data1_a = 1:3, data1_b = 2:4,
           data2_a = 3:5, data2_b = 4:6) %>%
  mutate(
    data1 = data1_a * data1_b,
    data2 = data2_a * data2_b
  )
#>   data1_a data1_b data2_a data2_b data1 data2
#> 1       1       2       3       4     2    12
#> 2       2       3       4       5     6    20
#> 3       3       4       5       6    12    30

我当前的解决方案是编写一个使用不带后缀的变量名的函数(即 data1 ),创建带后缀的名称,然后对该变量执行简单的 mutate(),使用 get()。然后,我为每个输出调用该函数:

My current solution is to write a function which takes the unsuffixed variable name (i.e. "data1"), creates the suffixed names and then performs a simple mutate() on that variable using get(). I then call that function for each output:

foo <- function(df, name) {
  a <- paste0(name, "_a")
  b <- paste0(name, "_b")

  return(
    mutate(
      df,
      !!name := get(a) * get(b)
    )
  )
}

data.frame(data1_a = 1:3, data1_b = 2:4,
           data2_a = 3:5, data2_b = 4:6) %>%
  foo("data1") %>%
  foo("data2")
#>   data1_a data1_b data2_a data2_b data1 data2
#> 1       1       2       3       4     2    12
#> 2       2       3       4       5     6    20
#> 3       3       4       5       6    12    30

(或在所有变量名上写一个循环,如果还有更多

(or write a loop over all the variable names if there were more of them)

但是如果可以使用 mutate_at 或类似的东西,那会更清洁

But if it's possible to use mutate_at or something of the sort, that'd be much cleaner.

推荐答案

在采用@akrun的优雅解决方案后,我注意到它效率很低(因为它必须重新创建两个数据帧),在具有20,000行和11个组的数据集上花了几乎一秒钟。

After adopting @akrun's elegant solution, I noticed it was unfortunately very inefficient (since it has to recreate two dataframes), taking almost a second on a dataset with 20,000 rows and 11 "groups".

所以不久前,我开发了以下函数(在 @ user12728748 ...很抱歉,我们没有在此处早些发布),该名称带有组名( data1, data2,等等)和使用前缀的公式,允许 bquote 样式引用常量名称:

So a while ago I developed the following function (with a bit of help from @user12728748... sorry for not posting here sooner), which takes the names of the groups ("data1", "data2", etc) and a formula using the prefixes, allowing for bquote-style quoting for constant names:

suppressPackageStartupMessages(library(dplyr))

mutateSet <- function(df, colNames, formula,
                      isPrefix = TRUE,
                      separator = "_") {
  vars <- all.vars(formula)

  # extracts names wrapped in `.()`
  escapedNames <- function (expr)
  {
    unquote <- function(e) {
      if (is.pairlist(e) || length(e) <= 1L) NULL
      else if (e[[1L]] == as.name("."))      deparse(e[[2L]])
      else                                   unlist(sapply(e, unquote))
    }
    unquote(substitute(expr))
  }

  escapedVars <- eval(rlang::expr(escapedNames(!!formula)))

  # remove escaped names from mapping variables
  vars <- setdiff(vars, escapedVars)

  # get output prefix/suffix as string
  lhs <- rlang::f_lhs(formula) %>%
    all.vars()

  # get operation as string
  # deparse() can have line breaks; paste0() brings it back to one line
  rhs <- rlang::f_rhs(formula) %>%
    deparse() %>%
    paste0(collapse = "")

  # dummy function to cover for bquote escaping
  . <- function(x) x

  for (i in colNames) {
    if (isPrefix) {
      aliases <- paste0(vars, separator, i)
      newCol  <- paste0(lhs,  separator, i)
    } else {
      aliases <- paste0(i, separator, vars)
      newCol  <- paste0(i, separator, lhs)
    }

    if (length(lhs) == 0) newCol <- i

    mapping <- rlang::list2(!!!aliases)
    names(mapping) <- vars
    mapping <- do.call(wrapr::qc, mapping)

    df <- rlang::expr(wrapr::let(
      mapping,
      df %>% dplyr::mutate(!!newCol := ...RHS...)
    )) %>%
      deparse() %>%
      gsub(
        pattern = "...RHS...",
        replacement = rhs
      ) %>%
      {eval(parse(text = .))}
  }

  return(df)
}

df <- data.frame(a_data1 = 1:3, b_data1 = 2:4,
                a_data2 = 3:5, b_data2 = 4:6,
                static = 5:7)

mutateSet(df, "data1", ~ a + b)
#>   a_data1 b_data1 a_data2 b_data2 static data1
#> 1       1       2       3       4      5     3
#> 2       2       3       4       5      6     5
#> 3       3       4       5       6      7     7
mutateSet(df, c("data1", "data2"), x ~ sqrt(a) + b)
#>   a_data1 b_data1 a_data2 b_data2 static  x_data1  x_data2
#> 1       1       2       3       4      5 3.000000 5.732051
#> 2       2       3       4       5      6 4.414214 7.000000
#> 3       3       4       5       6      7 5.732051 8.236068
mutateSet(df, c("data1", "data2"), ~ a + b + .(static))
#>   a_data1 b_data1 a_data2 b_data2 static data1 data2
#> 1       1       2       3       4      5     8    12
#> 2       2       3       4       5      6    11    15
#> 3       3       4       5       6      7    14    18

在2020-04-28由< a href = https://reprex.tidyverse.org rel = nofollow noreferrer> reprex软件包(v0.3.0)

Created on 2020-04-28 by the reprex package (v0.3.0)

这可能可以清除(特别是令人讨厌的for循环),但它现在可以使用。

This can probably be cleaned up (especially that heinous for-loop), but it works for now.

重复@ user12728748的性能测试,我们看到它快了约100倍。 :

Repeating @user12728748's performance test, we see this is ~100x faster:

suppressPackageStartupMessages({
    invisible(lapply(c("dplyr", "tidyr", "rlang", "wrapr", "microbenchmark"),
                     require, character.only = TRUE))
})

polymutate <- function(df, formula) {
    form <- rlang::f_rhs(formula)

    df %>% 
        mutate(rn = row_number()) %>%
        pivot_longer(cols = -rn, names_to = c('.value', 'grp'),
                     names_sep = "_") %>%
        group_by(grp) %>%
        transmute(rn, new = eval(form)) %>%
        pivot_wider(names_from = grp, values_from = new) %>%
        select(-rn) %>%
        bind_cols(df, .)
}

set.seed(1)                 
df <- setNames(data.frame(matrix(sample(1:12, 6E6, replace=TRUE), ncol=6)),
               c("a_data1", "b_data1", "a_data2", "b_data2", "a_data3", "b_data3"))

pd <- polymutate(df, ~ a + b)
pd2 <- mutateSet(df, c("data1", "data2", "data3"), ~ a + b)

all.equal(pd, pd2)
#> [1] TRUE

microbenchmark(polymutate(df, ~ a + b), 
               mutateSet(df, c("data1", "data2", "data3"), ~ a + b),
               times=10L)
#> Unit: milliseconds
#>        expr      min        lq       mean     median        uq       max neval
#>  polymutate 1612.306 1628.9776 1690.78586 1670.15600 1741.3490 1806.1412    10
#>  mutateSet     8.757    9.6302   13.27135   10.45965   19.2976   20.4657    10

这篇关于dplyr :: mutate_at()依赖于具有给定前缀/后缀的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆