用 R 中的相应替换字符串替换一组模式匹配 [英] Replace a set of pattern matches with corresponding replacement strings in R

查看:36
本文介绍了用 R 中的相应替换字符串替换一组模式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

str_replace(和 preg_replace) PHP 中的函数用替换字符串替换所有出现的搜索字符串.我在这里最感兴趣的是,如果 searchreplace args 是数组(在 R 中我们称之为向量),那么 str_replace 需要一个来自每个数组(向量)的值,并使用它们来搜索和替换主题.

The str_replace (and preg_replace) function in PHP replaces all occurrences of the search string with the replacement string. What interests me the most here, is that if search and replace args are arrays (in R we call that vectors), then str_replace takes a value from each array (vector) and uses them to search and replace on subject.

换句话说,R(或某些 R 包)是否具有执行以下操作的功能:

In other words, does R (or some R package) have a function to perform the following:

string <- "The quick brown fox jumped over the lazy dog."
patterns     <- c("quick", "brown", "fox")
replacements <- c("slow",  "black", "bear")
xxx_replace_xxx(string, patterns, replacements)          ## ???
## [1] "The slow black bear jumped over the lazy dog."

所以我正在寻找类似 chartr 的东西,但是搜索模式和任意数量字符的替换字符串.这不能通过对 gsub() 的一次调用来完成,因为它的 replacement 参数只能是单个字符串,请参阅 ?gsub.所以我目前的实现是这样的:

So I am seeking for something like chartr, but for search patterns and replacement strings of arbitrary number of characters. This cannot be done via one call to gsub() as its replacement argument can be a single string only, see ?gsub. So my current implementation is like:

xxx_replace_xxx <- function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, fixed=TRUE)
   string
}

但是,如果 length(patterns) 很大,我正在寻找更快的东西 - 我有很多数据要处理,但我对当前的结果不满意.

However, I am looking for something much faster if length(patterns) is large - I have a lot of data to process and I'm dissatisfied with the current results.

用于基准测试的示例玩具数据:

Exemplary toy data for benchmarking:

string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
   "po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
   "sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))

推荐答案

使用 PCRE 而不是固定匹配在我的机器上花费了大约 1/3 的时间.

Using PCRE instead of fixed matching takes ~1/3 the time on my machine for your example.

xxx_replace_xxx_pcre <- function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, perl=TRUE)
   string
}
system.time(x <- xxx_replace_xxx(string, patterns, replacements))
#    user  system elapsed 
#   0.491   0.000   0.491 
system.time(p <- xxx_replace_xxx_pcre(string, patterns, replacements))
#    user  system elapsed 
#   0.162   0.000   0.162 
identical(x,p)
# [1] TRUE

这篇关于用 R 中的相应替换字符串替换一组模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆