使用带别名的公式执行多列操作 [英] Using formulas with aliases to perform multi-column operations

查看:39
本文介绍了使用带别名的公式执行多列操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题与上一个相关我问过一个,但试图变得更通用.我想使用公式对多个组"数据(即 a_data1 a_data2 b_data1 b_data2 ,然后使用 * _ data1 列进行操作.

This question is related to a previous one I asked, but trying to be more generic. I want to use formulas to perform operations on multiple "groups" of data (i.e. a_data1, a_data2, b_data1, b_data2, and then make operations using the *_data1 columns).

基于@akrun对这个问题的回答,我创建了以下函数.它采用一个单边公式并将其应用于所有数据组":

Based on @akrun's answer to that question, I created the following function. It takes a one-sided formula and applies it to all the "groups of data":

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
})

polymutate <- function(df, formula,
                       pattern = "(.)_(.*)",
                       staticCols = NULL) {
  staticCols <- rlang::enquo(staticCols)

  rhs <- rlang::f_rhs(formula)
  names <- all.vars(rhs)
  df %>%
    mutate(
      rn = row_number()
    ) %>%
    pivot_longer(
      cols = -c(rn, !!staticCols),
      names_to = c(".value", "grp"),
      names_pattern = pattern
    ) %>%
    mutate(
      new = eval(rhs)
    ) %>%
    pivot_wider(
      names_from = grp,
      values_from = c(names, "new")
    ) %>%
    select(
      -rn
    ) %>%
    rename_at(
      vars(starts_with("new")),
      gsub, pattern = "^new_", replacement = ""
    )
}

df <- data.frame(a_data1 = 1:3, b_data1 = 2:4,
                 a_data2 = 3:5, b_data2 = 4:6,
                 static = 5:7)

polymutate(df, ~ a + b, staticCols = static)
#> # A tibble: 3 x 7
#>   static a_data1 a_data2 b_data1 b_data2 data1 data2
#>    <int>   <int>   <int>   <int>   <int> <int> <int>
#> 1      5       1       3       2       4     3     7
#> 2      6       2       4       3       5     5     9
#> 3      7       3       5       4       6     7    11

reprex软件包(v0.3.0)创建于2020-03-13 sup>

Created on 2020-03-13 by the reprex package (v0.3.0)

因此,此 polymutate 将数据帧转换为更长的格式,这样我们就有一列具有组名( data1 data2 )每个前缀一个( a b ).然后,它在此更深的数据帧的上下文中评估给定的公式(显然,公式中的名称必须与前缀匹配).完成后,它会将数据框扩展回其原始形状.

So, this polymutate converts the dataframe into a longer format such that we have one column with the group name (data1 or data2) and one per prefix (a and b). It then evaluates the given formula in the context of this deeper dataframe (obviously the names in the formula must match the prefixes). Once that's done, it widens the dataframe back to its original shape.

这很好,但是有点慢.在具有20,000行和11个组"的数据帧上使用它需要0.77秒.

This works quite well, but it's a bit slow. Using it on a dataframe with 20,000 rows and 11 "groups" takes 0.77 seconds.

我认为这是由于需要对如此大的数据帧进行两次重组:深化然后扩展.

I figured that was due to the need to restructure such a large dataframe twice: deepening and then widening it.

所以我想知道是否可以在没有麻烦的情况下做到这一点.我找到了 wrapr 软件包,该软件包使我们能够为名称创建别名.因此,我应该能够执行与上面类似的操作,并传递公式和要更改的列的名称.

So I wondered if I could do this without that hassle. I found the wrapr package, which allows us to create aliases for names. I should therefore be able to perform something similar to the above, passing the formula and the names of the columns I want to change.

然后可以提取公式中使用的变量,并使用它们来重建所需的列名,创建别名映射,然后使用该映射将公式应用于数据框.我已经很接近了,但是无法获得要评估的实际公式:

It could then extract the variables used in the formula and use them to rebuild the desired column names, create the alias mapping, and then use that mapping to apply the formula to the dataframe. I got quite close, but couldn't get the actual formula to be evaluated:

suppressPackageStartupMessages({
  library(dplyr)
})

polymutate2 <- function(df, formula, name) {
  vars <- all.vars(formula)
  rhs <- rlang::f_rhs(formula)
  aliases <- paste0(vars, "_", name)
  mapping <- rlang::list2(!!!aliases)
  names(mapping) <- vars

  mapping <- do.call(wrapr::qc, mapping)
  wrapr::let(
    mapping,
    df %>% mutate(!!name := a + b)
  )
}

df <- data.frame(a_data1 = 1:3, b_data1 = 2:4,
                 a_data2 = 3:5, b_data2 = 4:6,
                 static = 5:7)

polymutate2(df, ~ a + b, "data1")
#>   a_data1 b_data1 a_data2 b_data2 static data1
#> 1       1       2       3       4      5     3
#> 2       2       3       4       5      6     5
#> 3       3       4       5       6      7     7

reprex软件包(v0.3.0)创建于2020-03-13 sup>

Created on 2020-03-13 by the reprex package (v0.3.0)

您会注意到 mutate 调用具有一个硬编码的表达式,因为我无法使其与给定的公式一起使用.如先前版本中用 eval(rhs)替换该表达式会引发 object'a'not found 错误:

You'll notice the mutate call has a hard-coded expression, since I couldn't get it to work with the given formula. Replacing that expression with eval(rhs) as in the previous version throws an object 'a' not found error:

suppressPackageStartupMessages({
  library(dplyr)
  # library(tidyr)
})

polymutate2 <- function(df, formula, name) {
  vars <- all.vars(formula)
  rhs <- rlang::f_rhs(formula)
  aliases <- paste0(vars, "_", name)
  mapping <- rlang::list2(!!!aliases)
  names(mapping) <- vars

  mapping <- do.call(wrapr::qc, mapping)
  wrapr::let(
    mapping,
    df %>% mutate(!!name := eval(rhs))
  )
}

polymutate2(df, ~ a + b, "data1")
#> Error in eval(rhs): object 'a' not found

如果我可以使用它(并且假设该解决方案不会显着损害性能),它的运行速度会更快:只需花费0.03秒即可运行一系列 polymutate2 的链(一个用于我的20,000行数据框中的11个组中的每一个).

If I can get this to work (and assuming the solution doesn't dramatically harm performance), it's much faster: it only takes 0.03 seconds to run a chain of polymutate2's (one for each of the 11 groups in my 20,000 row dataframe).

那么,如何使 polymutate2 与任何公式一起使用?我愿意接受任何建议,如果存在其他解决方案,则无需使用 wrapr .(我还担心,如果公式很复杂,调用函数之类的东西,只是还没有设法检查的话,这种解决方案可能就行不通了.)

So, how can I get polymutate2 to work with any formula? I'm open to any sort of suggestion, no need to use wrapr if some other solution exists. (I'm also concerned this solution might not work if the formula is complex, calling functions or whatnot, just haven't managed to check yet).

推荐答案

也许某个知识渊博的人可以采用一种更加整洁的方法来解决问题,但是可以通过包装整个包装器来解决问题(不是很优雅):: let调用 eval(parse(text = ..))-绝对更快:

Maybe someone more knowledgeable can chime in with a more tidyverse-y approach, but the problem can be solved (not very elegantly, admittedly) by wrapping the entire wrapr::let call into eval(parse(text=..)) - it is definitely faster:


suppressPackageStartupMessages({
    invisible(lapply(c("dplyr", "tidyr", "rlang", "wrapr", "microbenchmark"),
                     require, character.only = TRUE))
})

polymutate <- function(df, formula,
                       pattern = "(.)_(.*)",
                       staticCols = NULL) {
    staticCols <- rlang::enquo(staticCols)

    rhs <- rlang::f_rhs(formula)
    names <- all.vars(rhs)
    df %>%
        mutate(
            rn = row_number()
        ) %>%
        pivot_longer(
            cols = -c(rn, !!staticCols),
            names_to = c(".value", "grp"),
            names_pattern = pattern
        ) %>%
        mutate(
            new = eval(rhs)
        ) %>%
        pivot_wider(
            names_from = grp,
            values_from = c(names, "new")
        ) %>%
        select(
            -rn
        ) %>%
        rename_at(
            vars(starts_with("new")),
            gsub, pattern = "^new_", replacement = ""
        )
}

polymutate2 <- function(df, formula, name) {
    vars <- all.vars(formula)
    rhs <- deparse(rlang::f_rhs(formula))
    aliases <- paste0(vars, "_", name)
    mapping <- rlang::list2(!!!aliases)
    names(mapping) <- vars
    mapping <- do.call(wrapr::qc, mapping)
    eval(parse(text=paste0("wrapr::let(mapping, df %>% mutate(!!name := ", rhs, "))" ))
    )
}

set.seed(1)                 
df <- setNames(data.frame(matrix(sample(1:12, 7E6, replace=TRUE), ncol=7)),
               c("a_data1", "b_data1", "a_data2", "b_data2", "a_data3", "b_data3", "static"))

pd <- polymutate(df, ~ a + b, staticCols = static)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(names)` instead of `names` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
pd2 <- polymutate2(df, ~ a + b, "data1") %>% polymutate2(., ~ a + b, "data2") %>% polymutate2(., ~ a + b, "data3") %>% dplyr::select(static, everything()) %>% 
    as_tibble()

all.equal(pd, pd2)
#> [1] TRUE

microbenchmark(polymutate(df, ~ a + b, staticCols = static), 
               polymutate2(df, ~ a + b, "data1") %>% polymutate2(., ~ a + b, "data2") %>% polymutate2(., ~ a + b, "data3") %>% dplyr::select(static, everything()) %>% 
                   as_tibble(),
               times=10L)
#> Unit: milliseconds
#>                                                                                                                                                                        expr
#>                                                                                                                                 polymutate(df, ~a + b, staticCols = static)
#>  polymutate2(df, ~a + b, "data1") %>% polymutate2(., ~a + b, "data2") %>%      polymutate2(., ~a + b, "data3") %>% dplyr::select(static,      everything()) %>% as_tibble()
#>          min          lq       mean     median         uq        max neval cld
#>  1143.582663 1151.206750 1171.46502 1173.03649 1188.91108 1209.01984    10   b
#>     9.553352    9.619473   10.88463   10.59397   12.27675   12.52403    10  a

reprex软件包(v0.3.0)创建于2020-03-14 sup>

Created on 2020-03-14 by the reprex package (v0.3.0)

这篇关于使用带别名的公式执行多列操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆