用 R 中的相应替换字符串替换一组模式匹配 [英] Replace a set of pattern matches with corresponding replacement strings in R
问题描述
str_replace(和 preg_replace) PHP 中的函数用替换字符串替换所有出现的搜索字符串.我在这里最感兴趣的是,如果 search
和 replace
args 是数组(在 R 中我们称之为向量),那么 str_replace
需要一个来自每个数组(向量)的值,并使用它们来搜索和替换主题.
The str_replace (and preg_replace) function in PHP replaces all occurrences of the search string with the replacement string. What interests me the most here, is that if search
and replace
args are arrays (in R we call that vectors), then str_replace
takes a value from each array (vector) and uses them to search and replace on subject.
换句话说,R(或某些 R 包)是否具有执行以下操作的功能:
In other words, does R (or some R package) have a function to perform the following:
string <- "The quick brown fox jumped over the lazy dog."
patterns <- c("quick", "brown", "fox")
replacements <- c("slow", "black", "bear")
xxx_replace_xxx(string, patterns, replacements) ## ???
## [1] "The slow black bear jumped over the lazy dog."
所以我正在寻找类似 chartr
的东西,但是搜索模式和任意数量字符的替换字符串.这不能通过对 gsub()
的一次调用来完成,因为它的 replacement
参数只能是单个字符串,请参阅 ?gsub
.所以我目前的实现是这样的:
So I am seeking for something like chartr
, but for search patterns and replacement strings of arbitrary number of characters. This cannot be done via one call to gsub()
as its replacement
argument can be a single string only, see ?gsub
. So my current implementation is like:
xxx_replace_xxx <- function(string, patterns, replacements) {
for (i in seq_along(patterns))
string <- gsub(patterns[i], replacements[i], string, fixed=TRUE)
string
}
但是,如果 length(patterns)
很大,我正在寻找更快的东西 - 我有很多数据要处理,但我对当前的结果不满意.
However, I am looking for something much faster if length(patterns)
is large - I have a lot of data to process and I'm dissatisfied with the current results.
用于基准测试的示例玩具数据:
Exemplary toy data for benchmarking:
string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
"po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
"sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))
推荐答案
使用 PCRE 而不是固定匹配在我的机器上花费了大约 1/3 的时间.
Using PCRE instead of fixed matching takes ~1/3 the time on my machine for your example.
xxx_replace_xxx_pcre <- function(string, patterns, replacements) {
for (i in seq_along(patterns))
string <- gsub(patterns[i], replacements[i], string, perl=TRUE)
string
}
system.time(x <- xxx_replace_xxx(string, patterns, replacements))
# user system elapsed
# 0.491 0.000 0.491
system.time(p <- xxx_replace_xxx_pcre(string, patterns, replacements))
# user system elapsed
# 0.162 0.000 0.162
identical(x,p)
# [1] TRUE
这篇关于用 R 中的相应替换字符串替换一组模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!