需要 18 小时才能运行的 R 循环的优化 [英] Optimization of an R loop taking 18 hours to run

查看:17
本文介绍了需要 18 小时才能运行的 R 循环的优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 R 代码,它可以运行并执行我想要的操作,但是它需要很长时间才能运行.下面是代码的作用和代码本身的解释.

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.

我有一个包含街道地址(字符串)的 200000 行向量:数据.示例:

I've got a vector of 200000 line containing street adresses (String) : data. Example :

> data[150000,]
                              address 
"15 rue andre lalande residence marguerite yourcenar 91000 evry france" 

我有一个 131x2 字符串元素的矩阵,它们是 5grams(单词的一部分)和 NGrams 包的 id(5Grams 包的示例:[stack"、tacko"、ackov"、ckover"", ",overf", ... ] ) : list_ngrams

And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams

list_ngrams 示例:

Example of list_ngrams :

  idSac ngram
1     4 stree
2     4 tree_ 
3     4 _stre
4     4 treet
5     5 avenu
6     5 _aven
7     5 venue
8     5 enue_

我还有一个用 0 初始化的 200000x31 数值矩阵:idv_x_bags

I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags

我总共有 131 个 5 克和 31 袋 5 克.

In total I have 131 5-grams and 31 bags of 5-grams.

我想循环字符串地址并检查它是否包含我的列表中的 n-gram 之一.如果是,我会在对应的列中放一个,该列代表包含 5-gram 的袋子的 id.示例:

I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram. Example :

在这个地址:15 rue andre lalande Residence marguerite yourcenar 91000 evry France".residence"这个词存在于包 ["resid","eside","dence",...] 中,id 为 5.所以我要把 1 放在名为 5 的列中.因此相应的行"idv_x_bags" 矩阵将如下所示:

In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :

> idv_x_sacs[150000,]
  4   5   6   8  10  12  13  15  17  18  22  26  29  34  35  36  42  43  45  46  47  48  52  55  81  82 108 114 119 122 123 
  0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 

这是执行以下操作的代码:

Here is the code that does :

idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac

    for(i in 1:nrow(idv_x_bags)) 
    {
        for(ngram in list_ngrams$ngram)
      {
        if(grepl(ngram,data[i,])==TRUE)
        {
          idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
          idv_x_bags[i,as.character(idSac)] <- 1
        }
      }
    }

代码完美地完成了我的目标,但它需要大约 18 小时,这是巨大的.我尝试使用 Rcpp 库用 C++ 重新编码它,但我遇到了很多问题.我试图使用 apply 重新编码它,但我做不到.这是我所做的:

The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it. Here is what I did :

apply(cbind(data,1:nrow(data),1,function(x){
  apply(list_ngrams,1,function(y){
   if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1} 
  })
}) 

我需要一些帮助来使用 apply 或其他一些运行速度比当前方法更快的方法来编码我的循环.非常感谢.

I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.

推荐答案

检查这个并逐步运行简单的示例以查看它是如何工作的.我的 N-Grams 没有多大意义,但它也适用于实际的 N_Grams.

Check this one and run the simple example step by step to see how it works. My N-Grams don't make much sense, but it will work with actual N_Grams as well.

 library(dplyr)
 library(reshape2)

 # your example dataset
 dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
 dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
                     ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)

 # sentence dataset
 dt_sen

sen
    1 this is a good thing
    2          this is bad


 #ngrams dataset
 dt_ngr

  id_ngr ngr
1      2  th
2      2  go
3      2  tt
4      3 drf
5      3 ytu
6      3 bad



 # create table of matches
 expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
   data.frame() %>%
   rename(sen = Var1,
          id_ngr = Var2) %>%
   left_join(dt_ngr, by = "id_ngr") %>%
   group_by(sen, id_ngr,ngr) %>%
   do(data.frame(match = grepl(.$ngr,.$sen))) %>%
   group_by(sen,id_ngr) %>%
   summarise(sum_success = sum(match)) %>%
   mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full

 dt_full
Source: local data frame [4 x 4]
Groups: sen

                   sen id_ngr sum_success match
1 this is a good thing      2           2     1
2 this is a good thing      3           0     0
3          this is bad      2           1     1
4          this is bad      3           1     1


 # reshape table
 dt_full %>% dcast(., sen~id_ngr, value.var = "match")
                   sen 2 3
1 this is a good thing 1 0
2          this is bad 1 1

这篇关于需要 18 小时才能运行的 R 循环的优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆