高效的字符串搜索和替换 [英] Efficient String Search and Replace

查看:68
本文介绍了高效的字符串搜索和替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试在包含职位的数据库中清除大约200万个条目。我想将其中许多缩写更改为单个一致且更易于搜索的选项。到目前为止,我只是使用单独的 mapply(gsub(...)命令浏览本专栏,但我需要进行约80处更改,因此几乎需要30项更改分钟即可运行。
必须有更好的方法。我是字符串搜索的新手,我发现了 * $ 的技巧,该技巧很有帮助。一种可以在单个 mapply 中进行多个搜索的方法吗?我想这可能会更快吗?
任何帮助都会很棒。谢谢。

I am trying to clean about 2 million entries in a database consisting of job titles. Many have several abbreviations that I wish to change to a single consistent and more easily searchable option. So far I am simply running through the column with individual mapply(gsub(...) commands. But I have about 80 changes to make this way, so it takes almost 30 minutes to run. There has got to be a better way. I'm new to string searching, I found the *$ trick, which helped. Is there a way to do more than one search in a single mapply? I imagine that maybe faster? Any help would be great. Thanks.

这是下面的一些代码。测试是一列包含200万个个人职称的列。

Here is some of the code below. Test is a column of 2 million individual job titles.

test <- mapply(gsub, " Admin ", " Administrator ", test)
test <- mapply(gsub, "Admin ", "Administrator ", test)
test <- mapply(gsub, " Admin*$", " Administrator", test)
test <- mapply(gsub, "Acc ", " Accounting ", test)
test <- mapply(gsub, " Admstr ", " Administrator ", test)
test <- mapply(gsub, " Anlyst ", " Analyst ", test)
test <- mapply(gsub, "Anlyst ", "Analyst ", test)
test <- mapply(gsub, " Asst ", " Assistant ", test)
test <- mapply(gsub, "Asst ", "Assistant ", test)
test <- mapply(gsub, " Assoc ", " Associate ", test)
test <- mapply(gsub, "Assoc ", "Associate ", test)


推荐答案

这是一个有效的基本R解决方案。您可以定义一个数据框,其中将包含所有模式及其替换。然后在行模式下使用 apply()并在 test <上调用 gsub() / code>每个模式/替换组合的向量。以下是演示此代码的示例代码:

Here is a base R solution which works. You can define a data frame which will contain all patterns and their replacements. Then you use apply() in row mode and call gsub() on your test vector for each pattern/replacement combination. Here is sample code demonstrating this:

df <- data.frame(pattern=c(" Admin ", "Admin "),
                 replacement=c(" Administrator ", "Administrator "))

test <- c(" Admin ", "Admin ")

apply(df, 1, function(x) {
                test <<- gsub(x[1], x[2], test)
             })

> test
[1] " Administrator " "Administrator " 

这篇关于高效的字符串搜索和替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆