在大型data.table上使用grepl的内存和性能 [英] Memory and Performance using grepl on large data.table
问题描述
我正在R上执行一个简单的命令而不是大数据集,结果很慢并且使用的内存太多。下面是一个使用两行的例子,虽然我的真实数据集有1.54亿行:
library(data.table)
Dt <-data.table(title1 = c(有史以来最酷的歌曲,
世界上最棒的音乐),
title2 = c(最酷的歌曲,最棒的音乐 ))
Dt $匹配< -sapply(seq_len(nrow(Dt)),function(x)grepl(Dt $ title2 [x],Dt $ title1 [x]))
Dt $ Match的结果应该为TRUE,TRUE。
在运行这个脚本之前,我剩下大约12Gb的内存,但是当这个慢速代码运行时,内存已经用完了。
还有更多有效的方法来获得相同的结果?也许利用Data Table包?
stringi
更高性能。 stri_detect_fixed(Dt $ title1,Dt $ title2)
应该是您要查找的内容。
(感谢弗兰克。弗兰克实际上找到了确切的DT答案:
Dt [,stri_detect_fixed(title1,title2)]
后缀 ..._ fixed
比 _regex
的快。
I'm performing a simple command in R over a large dataset, and the result is slow and uses too much memory. Here's a an example using two rows, although my real dataset has 154 million rows:
library(data.table)
Dt<-data.table(title1=c("The coolest song ever",
"The greatest music in the world"),
title2=c("coolest song","greatest music"))
Dt$Match<-sapply(seq_len(nrow(Dt)), function(x) grepl(Dt$title2[x],Dt$title1[x]))
The result of Dt$Match should be TRUE, TRUE. Before running this script, I have about 12 Gb of RAM left, but as this slow code runs, memory is being used up.
Is there a more efficient way to get the same results? Perhaps leveraging the Data Table package?
Use stringi
library, it's more performant.
stri_detect_fixed(Dt$title1, Dt$title2)
should be what you're looking for.
(thanks to Frank. Frank actually found the exact DT answer:
Dt[, stri_detect_fixed(title1, title2)]
The functions with suffix ..._fixed
are faster than the _regex
ones.
这篇关于在大型data.table上使用grepl的内存和性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!