比 r 中的 gsub 更快的方法 [英] Faster approach than gsub in r

查看：43 发布时间：2021/7/6 19:38:16 regex r

本文介绍了比 r 中的 gsub 更快的方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图找出是否有比 R 中的 gsub 向量化函数更快的方法.我有以下带有一些句子"(sent$words)的数据框，然后我有词用于从这些句子中删除(存储在 wordsForRemoving 变量中).

I'm trying to find out, if there is faster approach than gsub vectorized function in R. I have following data frame with some "sentences" (sent$words) and then I have words for removing from these sentences (stored in wordsForRemoving variable).

sent <- data.frame(words = 
                     c("just right size and i love this notebook", "benefits great laptop",
                       "wouldnt bad notebook", "very good quality", "bad orgtop but great",
                       "great improvement for that bad product but overall is not good", 
                       "notebook is not good but i love batterytop"), 
                   user = c(1,2,3,4,5,6,7),
                   stringsAsFactors=F)

wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
                      "right", "very","benefits", "extra","benefit","top","extraordinarily",
                      "extraordinary", "super","benefits super","good","benefits great",
                      "wouldnt bad")

然后我要为时间消耗计算创建大数据"模拟...

Then I'm gonna create "big data" simulation for time consumption computing...

df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL

使用以下 gsub 方法从 sent$words 中删除单词 (wordsForRemoving) 需要 72.87 秒.我知道，这不是很好的模拟，但实际上我使用的是包含 300.000 个句子的 3.000 多个单词的单词词典，整体处理需要 1.5 多个小时.

Using of following gsub approach for removing words (wordsForRemoving) from sent$words takes 72.87 sec. I know, this is not good simulation but in real I'm using word dictionary with more than 3.000 words for 300.000 sentences and overall processing takes over 1.5 hours.

pattern <- paste0("\\b(?:", paste(wordsForRemoving, collapse = "|"), ")\\b ?")
res <- gsub(pattern, "", sent$words)

#  user  system elapsed 
# 72.87    0.05   73.79

请，谁能帮我为我的任务编写更快的方法.非常感谢任何帮助或建议.非常感谢.

Please, could anyone help me to write faster approach for my task. Any help or advice is very appreciated. Thanks a lot in forward.

推荐答案

正如 Jason 所提到的，stringi 是您不错的选择..

As mentioned by Jason, stringi is good option for you..

以下是stringi的表现

Following is the performance of stringi

system.time(res <- gsub(pattern, "", sent$words))
   user  system elapsed 
 66.229   0.000  66.199 

library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
   user  system elapsed 
 21.246   0.320  21.552

更新(感谢阿伦)

system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
   user  system elapsed 
 12.290   0.000  12.281

这篇关于比 r 中的 gsub 更快的方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

比 r 中的 gsub 更快的方法 [英] Faster approach than gsub in r

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

比 r 中的 gsub 更快的方法 [英] Faster approach than gsub in r

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭