Tidyverse:根据部分匹配替换整个字符串 [英] Tidyverse: Replacing entire strings based on partial matches
问题描述
我希望使用 stringr
包中的函数根据部分匹配替换数据中的整个字符串条目.
I'm looking to replace entire string entries within data based on partial matches using functions in the stringr
package.
我尝试过的唯一方法是使用 str_replace_all()
替换完全匹配,但是当有许多变体需要纠正时,这变得乏味和笨拙.我希望根据部分匹配进行替换.在下面的 reprex 中,我通过直接规范替换了西班牙人"和哥伦比亚人"的变体.但是,我很乐意根据满足单词中存在Spa"或Col"的条件来执行这些替换.
The only method I've tried has been replacing exact matches using str_replace_all()
but this becomes tedious and unwieldy when there are dozens of variations to correct for. I'm looking to replace based on partial matches. In my reprex below, I replace variants of "Spaniard" and "Colombian" by direct specification. However, I would love to perform those replacements based on something like meeting the condition that "Spa" or "Col" exists in the words.
library(tidyverse)
library(stringr)
data <- c(
"Spanish",
"SPANIARD",
"Spainiard",
"Colombian",
"Columbian",
"Ecuador",
"Equador",
"Ecuadorian",
"VENEZUELAN"
)
str_replace_all(data,
c(
"Spanish" = "Spaniard",
"SPANIARD" = "Spaniard",
"Spainiard" = "Spaniard",
"Columbian" = "Colombian"
))
#> [1] "Spaniard" "Spaniard" "Spaniard" "Colombian" "Colombian"
#> [6] "Ecuador" "Equador" "Ecuadorian" "VENEZUELAN"
由 reprex 包 (v0.2.1) 于 2019 年 5 月 21 日创建上>
Created on 2019-05-21 by the reprex package (v0.2.1)
所以 str_replace_all()
像宣传的那样工作,但我正在寻找一种方法来简化 tidyverse 中的这个过程.非常感谢任何帮助.
So str_replace_all()
works as advertised, but I'm looking for a way to streamline this process in the tidyverse. Any help is much appreciated.
推荐答案
我更喜欢使用距离度量(例如,Jaro-winkler 距离或其他一些距离度量),但它们确实有其缺点.厌倦部分匹配可能会改变的内容.如果您正在进行部分匹配,最好看看有哪些可能性.但是,您可以使用 case_when
和 startsWith
或 grepl
来完成您在 tidyverse 中概述的内容:
I prefer to use a distance measure (e.g., Jaro-winkler's distance, or some other distance measure), but they do have their drawbacks. Be weary of what you could be changing with partial matching. If you are doing partial matching it would be wise to see what the possibilities are. But, you can do what you outlined in tidyverse using case_when
with startsWith
or grepl
:
tibble(data = data) %>%
mutate(
v1 = tolower(data),
new_name = case_when(
startsWith(v1, "spa") ~ "Spanaird",
startsWith(v1, "col") ~ "Colombian",
startsWith(v1, "eq") | startsWith(v1, "ec") ~ "Equadorian",
startsWith(v1, "ven") ~ "Venezuelan",
TRUE ~ as.character(data)))
# A tibble: 9 x 3
data v1 new_name
<chr> <chr> <chr>
1 Spanish spanish Spanaird
2 SPANIARD spaniard Spanaird
3 Spainiard spainiard Spanaird
4 Colombian colombian Colombian
5 Columbian columbian Colombian
6 Ecuador ecuador Equadorian
7 Equador equador Equadorian
8 Ecuadorian ecuadorian Equadorian
9 VENEZUELAN venezuelan Venezuelan
要查看您可以执行此操作(或其他几项操作)的可能性:
To see the possibilities you can do this (or several other things):
tibble(data = data) %>%
arrange(data) %>%
count(tolower(data))
这篇关于Tidyverse:根据部分匹配替换整个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!