Tidyverse:根据部分匹配替换整个字符串 [英] Tidyverse: Replacing entire strings based on partial matches

查看:54
本文介绍了Tidyverse:根据部分匹配替换整个字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用 stringr 包中的函数根据部分匹配替换数据中的整个字符串条目.

I'm looking to replace entire string entries within data based on partial matches using functions in the stringr package.

我尝试过的唯一方法是使用 str_replace_all() 替换完全匹配,但是当有许多变体需要纠正时,这变得乏味和笨拙.我希望根据部分匹配进行替换.在下面的 reprex 中,我通过直接规范替换了西班牙人"和哥伦比亚人"的变体.但是,我很乐意根据满足单词中存在Spa"或Col"的条件来执行这些替换.

The only method I've tried has been replacing exact matches using str_replace_all() but this becomes tedious and unwieldy when there are dozens of variations to correct for. I'm looking to replace based on partial matches. In my reprex below, I replace variants of "Spaniard" and "Colombian" by direct specification. However, I would love to perform those replacements based on something like meeting the condition that "Spa" or "Col" exists in the words.

library(tidyverse)
library(stringr)

data <- c(
  "Spanish",
  "SPANIARD",
  "Spainiard",
  "Colombian",
  "Columbian",
  "Ecuador",
  "Equador",
  "Ecuadorian",
  "VENEZUELAN"
)

str_replace_all(data,
                c(
                  "Spanish" = "Spaniard",
                  "SPANIARD" = "Spaniard",
                  "Spainiard" = "Spaniard",
                  "Columbian" = "Colombian"
                ))
#> [1] "Spaniard"   "Spaniard"   "Spaniard"   "Colombian"  "Colombian" 
#> [6] "Ecuador"    "Equador"    "Ecuadorian" "VENEZUELAN"

reprex 包 (v0.2.1) 于 2019 年 5 月 21 日创建

Created on 2019-05-21 by the reprex package (v0.2.1)

所以 str_replace_all() 像宣传的那样工作,但我正在寻找一种方法来简化 tidyverse 中的这个过程.非常感谢任何帮助.

So str_replace_all() works as advertised, but I'm looking for a way to streamline this process in the tidyverse. Any help is much appreciated.

推荐答案

我更喜欢使用距离度量(例如,Jaro-winkler 距离或其他一些距离度量),但它们确实有其缺点.厌倦部分匹配可能会改变的内容.如果您正在进行部分匹配,最好看看有哪些可能性.但是,您可以使用 case_whenstartsWithgrepl 来完成您在 tidyverse 中概述的内容:

I prefer to use a distance measure (e.g., Jaro-winkler's distance, or some other distance measure), but they do have their drawbacks. Be weary of what you could be changing with partial matching. If you are doing partial matching it would be wise to see what the possibilities are. But, you can do what you outlined in tidyverse using case_when with startsWith or grepl:

tibble(data = data) %>%
  mutate(
    v1 = tolower(data),
    new_name = case_when(
      startsWith(v1, "spa") ~ "Spanaird",
      startsWith(v1, "col") ~ "Colombian",
      startsWith(v1, "eq") | startsWith(v1, "ec") ~ "Equadorian",
      startsWith(v1, "ven") ~ "Venezuelan",
      TRUE ~ as.character(data)))

# A tibble: 9 x 3
  data       v1         new_name  
  <chr>      <chr>      <chr>     
1 Spanish    spanish    Spanaird  
2 SPANIARD   spaniard   Spanaird  
3 Spainiard  spainiard  Spanaird  
4 Colombian  colombian  Colombian 
5 Columbian  columbian  Colombian 
6 Ecuador    ecuador    Equadorian
7 Equador    equador    Equadorian
8 Ecuadorian ecuadorian Equadorian
9 VENEZUELAN venezuelan Venezuelan

要查看您可以执行此操作(或其他几项操作)的可能性:

To see the possibilities you can do this (or several other things):

tibble(data = data) %>%
  arrange(data) %>%
  count(tolower(data)) 

这篇关于Tidyverse:根据部分匹配替换整个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆