在大型data.table上使用grepl的内存和性能 [英] Memory and Performance using grepl on large data.table

查看:138
本文介绍了在大型data.table上使用grepl的内存和性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在R上执行一个简单的命令而不是大数据集,结果很慢并且使用的内存太多。下面是一个使用两行的例子,虽然我的真实数据集有1.54亿行:

  library(data.table)
Dt <-data.table(title1 = c(有史以来最酷的歌曲,
世界上最棒的音乐),
title2 = c(最酷的歌曲,最棒的音乐 ))

Dt $匹配< -sapply(seq_len(nrow(Dt)),function(x)grepl(Dt $ title2 [x],Dt $ title1 [x]))

Dt $ Match的结果应该为TRUE,TRUE。
在运行这个脚本之前,我剩下大约12Gb的内存,但是当这个慢速代码运行时,内存已经用完了。


还有更多有效的方法来获得相同的结果?也许利用Data Table包?

解决方案使用 stringi 更高性能。



stri_detect_fixed(Dt $ title1,Dt $ title2)应该是您要查找的内容。



(感谢弗兰克。弗兰克实际上找到了确切的DT答案:

  Dt [,stri_detect_fixed(title1,title2)] 

后缀 ..._ fixed _regex 的快。


I'm performing a simple command in R over a large dataset, and the result is slow and uses too much memory. Here's a an example using two rows, although my real dataset has 154 million rows:

library(data.table)
Dt<-data.table(title1=c("The coolest song ever",
"The greatest music in the world"),
title2=c("coolest song","greatest music"))

Dt$Match<-sapply(seq_len(nrow(Dt)), function(x) grepl(Dt$title2[x],Dt$title1[x]))

The result of Dt$Match should be TRUE, TRUE. Before running this script, I have about 12 Gb of RAM left, but as this slow code runs, memory is being used up.

Is there a more efficient way to get the same results? Perhaps leveraging the Data Table package?

解决方案

Use stringi library, it's more performant.

stri_detect_fixed(Dt$title1, Dt$title2) should be what you're looking for.

(thanks to Frank. Frank actually found the exact DT answer:

Dt[, stri_detect_fixed(title1, title2)]

The functions with suffix ..._fixed are faster than the _regex ones.

这篇关于在大型data.table上使用grepl的内存和性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆