比 r 中的 gsub 更快的方法 [英] Faster approach than gsub in r
问题描述
我试图找出是否有比 R 中的 gsub 向量化函数更快的方法.我有以下带有一些句子"(sent$words)的数据框,然后我有词用于从这些句子中删除(存储在 wordsForRemoving 变量中).
I'm trying to find out, if there is faster approach than gsub vectorized function in R. I have following data frame with some "sentences" (sent$words) and then I have words for removing from these sentences (stored in wordsForRemoving variable).
sent <- data.frame(words =
c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "bad orgtop but great",
"great improvement for that bad product but overall is not good",
"notebook is not good but i love batterytop"),
user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
"right", "very","benefits", "extra","benefit","top","extraordinarily",
"extraordinary", "super","benefits super","good","benefits great",
"wouldnt bad")
然后我要为时间消耗计算创建大数据"模拟...
Then I'm gonna create "big data" simulation for time consumption computing...
df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL
使用以下 gsub 方法从 sent$words 中删除单词 (wordsForRemoving) 需要 72.87 秒.我知道,这不是很好的模拟,但实际上我使用的是包含 300.000 个句子的 3.000 多个单词的单词词典,整体处理需要 1.5 多个小时.
Using of following gsub approach for removing words (wordsForRemoving) from sent$words takes 72.87 sec. I know, this is not good simulation but in real I'm using word dictionary with more than 3.000 words for 300.000 sentences and overall processing takes over 1.5 hours.
pattern <- paste0("\\b(?:", paste(wordsForRemoving, collapse = "|"), ")\\b ?")
res <- gsub(pattern, "", sent$words)
# user system elapsed
# 72.87 0.05 73.79
请,谁能帮我为我的任务编写更快的方法.非常感谢任何帮助或建议.非常感谢.
Please, could anyone help me to write faster approach for my task. Any help or advice is very appreciated. Thanks a lot in forward.
推荐答案
正如 Jason 所提到的,stringi 是您不错的选择..
As mentioned by Jason, stringi is good option for you..
以下是stringi的表现
Following is the performance of stringi
system.time(res <- gsub(pattern, "", sent$words))
user system elapsed
66.229 0.000 66.199
library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
user system elapsed
21.246 0.320 21.552
更新(感谢阿伦)
system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
user system elapsed
12.290 0.000 12.281
这篇关于比 r 中的 gsub 更快的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!