R:替换字符串中的外来字符 [英] R: Replacing foreign characters in a string
问题描述
我正在处理大量数据,主要是带有非英文字符的名称.我的目标是将这些姓名与在美国收集的有关他们的一些信息进行匹配.
I'm dealing with a large amount of data, mostly names with non-English characters. My goal is to match these names against some information on them collected in the USA.
即,我可能想将名称Sølvsten"(来自某些名称列表)与Soelvsten"(存储在某个美国数据库中的名称)相匹配.这是我写的一个函数来做到这一点.这显然很笨拙而且有些随意,但我想知道是否有一个简单的 R 函数可以将这些外来字符转换为它们最近的英语邻居.我知道可能没有任何标准的方法来进行这种转换,但我只是好奇是否有这种转换,以及是否可以通过 R 函数完成这种转换.
ie, I might want to match the name 'Sølvsten' (from some list of names) to 'Soelvsten' (the name as stored in some American database). Here is a function I wrote to do this. It's clearly clunky and somewhat arbitrary, but I wonder if there is a simple R function that translates these foreign characters to their nearest English neighbours. I understand that there might not be any standard way to do this conversion, but I'm just curious if there is and if that conversion can be done through an R function.
# a function to replace foreign characters
replaceforeignchars <- function(x)
{
require(gsubfn);
x <- gsub("š","s",x)
x <- gsub("œ","oe",x)
x <- gsub("ž","z",x)
x <- gsub("ß","ss",x)
x <- gsub("þ","y",x)
x <- gsub("à","a",x)
x <- gsub("á","a",x)
x <- gsub("â","a",x)
x <- gsub("ã","a",x)
x <- gsub("ä","a",x)
x <- gsub("å","a",x)
x <- gsub("æ","ae",x)
x <- gsub("ç","c",x)
x <- gsub("è","e",x)
x <- gsub("é","e",x)
x <- gsub("ê","e",x)
x <- gsub("ë","e",x)
x <- gsub("ì","i",x)
x <- gsub("í","i",x)
x <- gsub("î","i",x)
x <- gsub("ï","i",x)
x <- gsub("ð","d",x)
x <- gsub("ñ","n",x)
x <- gsub("ò","o",x)
x <- gsub("ó","o",x)
x <- gsub("ô","o",x)
x <- gsub("õ","o",x)
x <- gsub("ö","o",x)
x <- gsub("ø","oe",x)
x <- gsub("ù","u",x)
x <- gsub("ú","u",x)
x <- gsub("û","u",x)
x <- gsub("ü","u",x)
x <- gsub("ý","y",x)
x <- gsub("ÿ","y",x)
x <- gsub("ğ","g",x)
return(x)
}
注意:我知道存在名称匹配算法,例如 Jaro Winkler 距离匹配,但我更愿意进行精确匹配.
Note: I know there exist name matching algorithms such as Jaro Winkler Distance Matching, but I'd rather do exact matches.
推荐答案
尝试使用 chartr
R 函数进行单字符替换(应该很快),然后用一系列清理gsub
调用每个一到两个字符替换(这可能会更慢,但数量不多).
Try using the chartr
R function for the one character substitutions (which should be quite fast) and then clean it up with a series of gsub
calls for each of the one-to-two character substitutions (which presumably will be slower but there are not many of them).
to.plain <- function(s) {
# 1 character substitutions
old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý"
new1 <- "szyaaaaaaceeeeiiiidnooooouuuuy"
s1 <- chartr(old1, new1, s)
# 2 character substitutions
old2 <- c("œ", "ß", "æ", "ø")
new2 <- c("oe", "ss", "ae", "oe")
s2 <- s1
for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)
s2
}
根据需要添加到 old1
、new1
、old2
和 new2
.
Add to old1
, new1
, old2
and new2
as needed.
这是一个测试:
> s <- "æxš"
> to.plain(s)
[1] "aexs"
更新:更正了chartr
中的变量名称.
UPDATE: corrected variable names in chartr
.
这篇关于R:替换字符串中的外来字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!