使用带有 a 的 pmap 将不同的正则表达式应用于 tibble 中的不同变量? [英] Using pmap with a to apply different regular expressions to different variables in a tibble?

查看:37
本文介绍了使用带有 a 的 pmap 将不同的正则表达式应用于 tibble 中的不同变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题与使用 pmap 将不同的正则表达式应用于 tibble 中的不同变量?,但有所不同,因为我意识到我的示例不足以描述我的问题.

This question is very similar to Using pmap to apply different regular expressions to different variables in a tibble?, but differs because I realized my examples were not sufficient to describe my problem.

我正在尝试将不同的正则表达式应用于小标题中的不同变量.例如,我制作了一个 tibble 列表 1) 我要修改的变量名称,2) 我要匹配的正则表达式,以及 3) 替换字符串.我想将正则表达式/替换应用于不同数据框中的变量.请注意,目标 tibble 中可能存在我不想修改的变量,并且我的配置"tibble 中的行顺序可能与我的目标"tibble 中的列/变量顺序不一致.

I'm trying to apply different regular expressions to different variables in a tibble. For example, I've made a tibble listing 1) the variable name I want to modify, 2) the regex I want to match, and 3) the replacement string. I'd like to apply the regex/replacement to the variable in a different data frame. Note that there may be variables in the target tibble that I don't want to modify, and the row order in my "configuration" tibble may not correspond to the column/variable order in my "target" tibble.

所以我的配置"标题可能如下所示:

So my "configuration" tibble could look like this:

test_config <-  dplyr::tibble(
  string_col = c("col1", "col2", "col4", "col3"),
  pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
  replacement = c("","","", "")
)

我想将此应用于目标小标题:

I'd like to apply this to a target tibble:

test_target <- dplyr::tibble(
  col1 = c("Foo", "bar", ".", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "NA", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", "NULL"),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

所以目标是在 test_target 的用户指定列/变量中用空字符串替换不同的字符串.

So the goal is to replace a different string with an empty string in user-specified column/variables of the test_target.

结果应该是这样的:

result <- dplyr::tibble(
  col1 = c("Foo", "bar", "", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", ""),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

我可以用 for 循环做我想做的事,就像这样:

I can do what I want with a for loop, like this:

for (i in seq(nrow(test_config))) {
  test_target <- dplyr::mutate_at(test_target,
                   .vars = dplyr::vars(
                     tidyselect::matches(test_config$string_col[[i]])),
                   .funs = dplyr::funs(
                     stringr::str_replace_all(
                       ., test_config$pattern[[i]], 
                       test_config$replacement[[i]]))
  )
}

相反,有没有更整洁的方式来做我想做的事?到目前为止,我认为 purrr::pmap 是完成这项工作的工具,我制作了一个函数,它接受一个数据框、变量名、正则表达式和替换值,并返回数据框修改了单个变量.它的行为符合预期:

Instead, is there a more tidy way to do what I want? So far, thinking that purrr::pmap was the tool for the job, I've made a function that takes a data frame, variable name, regular expression, and replacement value and returns the data frame with a single variable modified. It behaves as expected:

testFun <- function(df, colName, regex, repVal){
  colName <- dplyr::enquo(colName)
  df <- dplyr::mutate_at(df,
                         .vars = dplyr::vars(
                           tidyselect::matches(!!colName)),
                         .funs = dplyr::funs(
                           stringr::str_replace_all(., regex, repVal))
  )
}

# try with example
out <- testFun(test_target, 
               test_config$string_col[[1]], 
               test_config$pattern[[1]], 
               "")

然而,当我尝试将该函数与 pmap 一起使用时,我遇到了几个问题:1) 有没有比这更好的方法来为 pmap 调用构建列表?

However, when I try to use that function with pmap, I run into a couple problems: 1) is there a better way to build the list for the pmap call than this?

purrr::pmap(
    list(test_target, 
         test_config$string_col, 
         test_config$pattern, 
         test_config$replacement),
    testFun
)

2) 当我调用 pmap 时,出现错误:

2) When I call pmap, I get an error:

Error: Element 2 has length 4, not 1 or 5.

所以 pmap 不高兴我试图将长度为 5 的 tibble 作为其他元素长度为 4 的列表的元素传递(我认为它会回收 tibble).

So pmap isn't happy that I'm trying to pass a tibble of length 5 as an element of a list whose other elements are of length 4 (I thought it would recycle the tibble).

还要注意之前,当我用 4 行小标题调用 pmap 时,我得到了一个不同的错误,

Note also that previously, when I called pmap with a 4-row tibble, I got a different error,

Error in UseMethod("tbl_vars") : 
  no applicable method for 'tbl_vars' applied to an object of class "character"
Called from: tbl_vars(tbl)

你们中的任何人都可以建议一种使用 pmap 来做我想做的事情的方法,或者是否有不同或更好的 tidyverse 方法来解决这个问题?

Can any of you suggest a way to use pmap to do what I want, or is there a different or better tidyverse approach to the problem?

谢谢!

推荐答案

这里有两种 tidyverse 方式.一种类似于 data.table 答案,因为它涉及重新整形数据,将其与配置连接,然后重新整形回宽.另一个是基于 purrr 的,在我看来,有点奇怪.我推荐第一个,因为它感觉更直观.

Here are two tidyverse ways. One is similar to the data.table answer, in that it involves reshaping the data, joining it with the configs, and reshaping back to wide. The other is purrr-based and, in my opinion, a little bit of a weird approach. I'd recommend the first, since it feels more intuitive.

使用 tidyr::gather 使数据长形,然后使用 dplyr::left_join 确保 test_target 中的每个文本值> 有相应的模式 &替换——即使是没有模式的案例 (col5) 也会使用左连接保留.

Use tidyr::gather to make the data long-shaped, then dplyr::left_join to make sure that every text value from test_target has a corresponding pattern & replacement—even the cases (col5) without patterns will be retained by using a left join.

library(tidyverse)
...

test_target %>%
  gather(key = col, value = text) %>%
  left_join(test_config, by = c("col" = "string_col"))
#> # A tibble: 25 x 4
#>    col   text  pattern replacement
#>    <chr> <chr> <chr>   <chr>      
#>  1 col1  Foo   "^\\.$" ""         
#>  2 col1  bar   "^\\.$" ""         
#>  3 col1  .     "^\\.$" ""         
#>  4 col1  NA    "^\\.$" ""         
#>  5 col1  NULL  "^\\.$" ""         
#>  6 col2  Foo   ^NA$    ""         
#>  7 col2  bar   ^NA$    ""         
#>  8 col2  .     ^NA$    ""         
#>  9 col2  NA    ^NA$    ""         
#> 10 col2  NULL  ^NA$    ""         
#> # ... with 15 more rows

使用 ifelse 替换存在模式的模式,如果模式不存在,则保留原始文本.保留必要的模式,添加行号,因为 spread 需要唯一的 ID,然后再次使数据变宽.

Using an ifelse replace the pattern where a pattern exists, or keep the original text if the pattern doesn't. Keep just the necessary patterns, add a row number because spread needs unique IDs, and make the data wide again.

test_target %>%
  gather(key = col, value = text) %>%
  left_join(test_config, by = c("col" = "string_col")) %>% 
  mutate(new_text = ifelse(is.na(pattern), text, str_replace(text, pattern, replacement))) %>%
  select(col, new_text) %>%
  group_by(col) %>%
  mutate(row = row_number()) %>%
  spread(key = col, value = new_text) %>%
  select(-row)
#> # A tibble: 5 x 5
#>   col1  col2  col3  col4  col5    
#>   <chr> <chr> <chr> <chr> <chr>   
#> 1 Foo   Foo   Foo   NULL  I       
#> 2 bar   bar   bar   NA    am      
#> 3 ""    .     .     Foo   not     
#> 4 NA    ""    NA    .     changing
#> 5 NULL  NULL  ""    bar   .

第二种方法是仅包含列名称的一小部分,将其与配置连接,然后拆分为列表列表.然后purrr::map2_dfc 映射您创建的这个列表和test_target 的列,并通过cbind 返回一个数据框.这样做的原因是数据框在技术上是列的列表,因此如果您映射数据框,您将每一列视为列表项.我无法让 ifelse 在这里工作 - 逻辑中的某些内容只有单个字符串返回而不是整个向量.

The second way is to make a tiny tibble of just the column names, join that with the configs, and split into a list of lists. Then purrr::map2_dfc maps over both this list you've created and the columns of test_target, and returns a data frame by cbinding. The reason this works is that data frames are technically lists of columns, so if you map over a data frame, you're treating each column like a list item. I couldn't get a ifelse to work right here—something in the logic had only single strings coming back instead of the whole vector.

tibble(all_cols = names(test_target)) %>%
  left_join(test_config, by = c("all_cols" = "string_col")) %>%
  split(.$all_cols) %>%
  map(as.list) %>%
  map2_dfc(test_target, function(info, text) {
    if (is.na(info$pattern)) {
      text
    } else {
      str_replace(text, info$pattern, info$replacement)
    }
  })
#> # A tibble: 5 x 5
#>   col1  col2  col3  col4  col5    
#>   <chr> <chr> <chr> <chr> <chr>   
#> 1 Foo   Foo   Foo   NULL  I       
#> 2 bar   bar   bar   NA    am      
#> 3 ""    .     .     Foo   not     
#> 4 NA    ""    NA    .     changing
#> 5 NULL  NULL  ""    bar   .

reprex 包 (v0.2.1) 于 2018 年 10 月 30 日创建

Created on 2018-10-30 by the reprex package (v0.2.1)

这篇关于使用带有 a 的 pmap 将不同的正则表达式应用于 tibble 中的不同变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆