R中的快速部分字符串匹配 [英] Fast partial string matching in R

查看:61
本文介绍了R中的快速部分字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个字符串向量 texts 和一个模式向量 patterns,我想为每个文本找到任何匹配的模式.

Given a vector of strings texts and a vector of patterns patterns, I want to find any matching pattern for each text.

对于小数据集,这可以在 R 中使用 grepl 轻松完成:

For small datasets, this can be easily done in R with grepl:

patterns = c("some","pattern","a","horse")
texts = c("this is a text with some pattern", "this is another text with a pattern")

# for each x in patterns
lapply( patterns, function(x){
  # match all texts against pattern x
  res = grepl( x, texts, fixed=TRUE )
  print(res)
  # do something with the matches
  # ...
})

这个解决方案是正确的,但它没有扩大规模.即使有中等大的数据集(约 500 个文本和模式),这段代码也非常缓慢,在现代机器上每秒只能解决大约 100 个案例 - 考虑到这是一个粗略的字符串部分匹配,没有正则表达式(设置为 <代码>固定=真).即使使 lapply 并行也不能解决问题.有没有办法有效地重写这段代码?

This solution is correct, but it doesn't scale up. Even with moderately bigger datasets (~500 texts and patterns), this code is embarassingly slow, solving only about 100 cases per sec on a modern machine - which is ridiculous considering that this is a crude string partial matching, without regex (set with fixed=TRUE). Even making the lapply parallel does not solve the issue. Is there a way to re-write this code efficiently?

谢谢,穆隆

推荐答案

使用 stringi 包 - 它甚至比 grepl 还要快.检查基准!我使用了@Martin-Morgan 帖子中的文字

Use stringi package - it's even faster than grepl. Check the benchmarks! I used text from @Martin-Morgan post

require(stringi)
require(microbenchmark)

text = readLines("~/Desktop/pg100.txt")
pattern <-  strsplit("all the world's a stage and all the people players", " ")[[1]]

grepl_fun <- function(){
    lapply(pattern, grepl, text, fixed=TRUE)
}

stri_fixed_fun <- function(){
    lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
}

#        microbenchmark(grepl_fun(), stri_fixed_fun())
#    Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#          grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509   100
#     stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913   100

# if you don't believe me that the results are equal, you can check :)
xx <- grepl_fun()
stri <- stri_fixed_fun()

for(i in seq_along(xx)){
    print(all(xx[[i]] == stri[[i]]))
}

这篇关于R中的快速部分字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆