R循环的优化需要18个小时才能运行 [英] Optimization of an R loop taking 18 hours to run
问题描述
我有一个R代码可以正常工作,并且可以执行我想要的操作,但是要花很多时间才能运行.这是代码功能以及代码本身的说明.
I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
我有一个200000行向量,其中包含街道地址(字符串):数据. 例子:
I've got a vector of 200000 line containing street adresses (String) : data. Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
我有一个131x2字符串元素的矩阵,这些元素是5克(单词的一部分)和NGrams袋的ID(例如5Grams袋的示例:["stack","tacko","ackov","ckover ,",overf,...]):list_ngrams
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
list_ngrams的示例:
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
我还有一个200000x31数值矩阵,其初始化为0:idv_x_bags
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
我总共有131克5克和31袋5克.
In total I have 131 5-grams and 31 bags of 5-grams.
我想循环显示字符串地址,并检查它是否包含列表中的n-gram之一.如果是这样,我在对应的列中放一个,代表包含5克袋子的ID. 例子:
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram. Example :
在此地址中:"15 rue andre lalande Residence marguerite yourcenar 91000 evry France". ID为5的袋子["resid","eside","dence",...]中存在单词"residence".因此,我将在名为5的列中放入1.因此,对应的行"idv_x_bags"矩阵如下所示:
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
这是执行此操作的代码:
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
该代码可以很好地完成我的目标,但是大约需要18个小时,这是巨大的.我尝试使用Rcpp库使用c ++对其进行重新编码,但是遇到很多问题.我尝试使用Apply对其进行重新编码,但我做不到. 这是我所做的:
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it. Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
在使用Apply或其他方法运行比当前方法快的循环编码方面,我需要一些帮助.非常感谢你.
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
推荐答案
选中此选项并逐步运行简单的示例以查看其工作方式. 我的N-Grams没有多大意义,但它也可以与实际的N_Grams一起使用.
Check this one and run the simple example step by step to see how it works. My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1
这篇关于R循环的优化需要18个小时才能运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!