R编程:如何加速一个需要2个小时的循环,以及为什么它需要很多 [英] R programming :How to speed up a loop that takes 2 hours and the reasons why it takes a lot

查看:212
本文介绍了R编程:如何加速一个需要2个小时的循环,以及为什么它需要很多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在做一个for循环来填充一个向量,但这个循环需要2小时。我不知道是否是因为也许我做的事情没有效率,或者如果它只是因为R是慢循环。我必须为这部分使用一个循环,因为我需要以前的值,所以我不能向量化操作。

I'm doing a for loop to fill a vector, but this loop is taking 2 hours. I don't know if it is because maybe I'm doing something ineffeciently or if it's just because R is slow with loops. I have to use a loop for this part because I need the previous value so I cannot vectorize the operation.

我使用的是包data.table。

我的笔记本电脑有8gb的RAM,英特尔核心i5 pro 2.3GHz。

R版本64位3.2.3

I'm using the package data.table.
My laptop has 8gb of ram, Intel core i5 pro 2.3GHz.
R version 64 bits 3.2.3

以下结构(按NUMDCRED和FDES升序排序):

The table has the following structure (It is sorted by NUMDCRED and FDES ascending):

NUMDCRED         FDES       Flag_Entrada_Mora  Flag_Salida_Mora   
 0001        "2012-01-01"         0                   0
 0001        "2012-03-01"         1                   0
 0001        "2012-04-01"         0                   0
 0002        "2011-01-01"         0                   0
 0002        "2011-02-01"         0                   0
 0002        "2011-03-01"         0                   0
 0003        "2012-05-01"         0                   0
 0003        "2012-06-01"         0                   1
 0003        "2012-07-01"         0                   0

变量FDES,FLAG_Entrada_Mora和FLAG_Salida_Mora来创建两个新变量Ult_Entrada_Mora和Ult_Salida_Mora。 Ult_Entrada_Mora注册NUMDCRED输入mora的最后日期,Ult_Salida_Mora注册NUMDCRED超出mora的最后日期。当每个NUMDCRED是第一个(这意味着出现NUMDCRED的第一个日期)时,Ult_Entrada_Mora必须是FDES值,并且该日期必须重复,直到每次Flag_Entrada为1时更新,每次NUM_DCRED为Ult_Salida_Mora首先它必须注册一个NA值,直到由Flag_Salida_Mora更新,并且这一个必须重复,直到更新等等。

The code uses the Variable FDES, FLAG_Entrada_Mora and FLAG_Salida_Mora to create two new variables Ult_Entrada_Mora and Ult_Salida_Mora. Ult_Entrada_Mora register the last date at which the NUMDCRED entered mora, Ult_Salida_Mora register the last date at which the NUMDCRED went out of mora. When each NUMDCRED is the first (with this I mean the first date at which a NUMDCRED appears) the Ult_Entrada_Mora must be the FDES value and that date must be repeated until is updated each time the Flag_Entrada is 1, with Ult_Salida_Mora each time the NUMDCRED is the first it must register an NA value until is updated by the Flag_Salida_Mora and this one must be repeated until is updated and so on.

在我的代码First_Numdcred_Index给我的行其中出现一个新的NUMDCRED,为什么我需要检查%in%如果i值属于这些索引之一。 aux_entrada和aux_salida只在每次我描述的事件之一更新时更新。

In my code First_Numdcred_Index give me the rows where a new NUMDCRED appears, that why I need to check with the %in% if the i value belongs to one of this indexes. aux_entrada and aux_salida are only updated each time one of the events I described before happen.

上面示例的表输出将是

NUMDCRED         FDES       Flag_Entrada_Mora  Flag_Salida_Mora Ult_En_Mo
 0001        "2012-01-01"         0                   0         "2012-01-01"
 0001        "2012-03-01"         1                   0         "2012-03-01"
 0001        "2012-04-01"         0                   0         "2012-03-01"
 0002        "2011-01-01"         0                   0         "2011-01-01"
 0002        "2011-02-01"         0                   0         "2011-01-01"
 0002        "2011-03-01"         0                   0         "2011-01-01"
 0003        "2012-05-01"         0                   0         "2012-05-01"
 0003        "2012-06-01"         0                   1         "2012-05-01"
 0003        "2012-07-01"         0                   0         "2012-05-01"

Ult_Salida_Mora
     NA
     NA
     NA
     NA 
     NA
     NA
     NA
   "2012-06-01"
   "2012-06-01"


$ b b

这是我用来运行循环的代码( n2 = 648,385

code> First_NumCred_index 是包含表的一系列行索引的向量。其长度为 148,982 和类数字

FDES 是类 IDate Flag_Entrada Flag_Salida 数字

First_NumCred_index is a vector containing a series of row index of the table. Its length is 148,982 and class numeric.
FDES is class IDate and Flag_Entrada and Flag_Salida are numeric.

n2 <- length(Poblacion_Morosa3$NUMDCRED)
Ult_Entrada_Mora <- seq(as.IDate("2020-01-01"),by = "month",length.out = n2)
#vector(mode = "character",length=n2)
Ult_Salida_Mora <- seq(as.IDate("2020-01-01"),by = "month",length.out = n2)

aux_entrada <- as.IDate("2005-01-01")
aux_salida <- as.IDate("2005-01-01")

for(i in 1:n2){ 

 if(i %in% First_NumdCred_index){

    aux_entrada <- Poblacion_Morosa3[i,FDES]
    aux_salida <- NA
   } else if(Poblacion_Morosa3[i,Flag_Entrada_Mora] == 1){

     aux_entrada <- Poblacion_Morosa3[i,FDES]
   } else if(Poblacion_Morosa3[i,Flag_Salida_Mora] == 1){

    aux_salida <- Poblacion_Morosa3[i,FDES]
   }

  Ult_Entrada_Mora[i] <- aux_entrada
  Ult_Salida_Mora[i] <- aux_salida
}

我想知道如果它是正常的, 。

I would like to know if it is normal that it takes more tan 2 hours to run or if I'm doing something inefficiently.

推荐答案

在我看来, findInterval() 最适合这个问题的功能。除了在行序列中的已知标记,它们改变为已知的值,固定( NA )或在输入框架内查找 FDES 列)。我们可以使用 findInterval()根据所需的逻辑查找最接近的上一个标记,并用获胜标记索引索引一个目标值的向量。

In my opinion, findInterval() is the most appropriate function for this problem. Your intermediate variables basically retain their previous values except at known marks within the row sequence, where they change to known values, either fixed (NA) or looked up within the input frame (FDES column). We can use findInterval() to find the closest-previous mark based on the required logic, and index a vector of target values with the winning mark index.

## libs
library(data.table);

## generate test data
set.seed(4L);
n2 <- 648385L;
Poblacion_Morosa3 <- data.table(
    NUMDCRED=sprintf('%04d',cumsum(c(T,sample(c(rep(F,3L),T),n2-1L,replace=T)))), ## avg 4 rows per num
    FDES=seq(as.IDate('2011-01-01'),by=1,len=n2),
    Flag_Entrada_Mora=sample(c(rep(0L,5L),1L),n2,replace=T), ## avg 6 rows per flag
    Flag_Salida_Mora=sample(c(rep(0L,5L),1L),n2,replace=T) ## ditto
);

## solution
system.time({
    findLastIndex <- function(iall,imark) c(0L,imark)[findInterval(iall,imark)+1L];
    n2 <- nrow(Poblacion_Morosa3);
    row.seq <- seq_len(n2);
    num.start <- c(T,Poblacion_Morosa3[,NUMDCRED[-.N]!=NUMDCRED[-1L]]);
    entrada.fdes <- findLastIndex(row.seq,which(num.start | Poblacion_Morosa3[,Flag_Entrada_Mora==1]));
    Ult_Entrada_Mora <- Poblacion_Morosa3[entrada.fdes,FDES];
    salida.na <- findLastIndex(row.seq,which(num.start));
    salida.fdes <- findLastIndex(row.seq,which(Poblacion_Morosa3[,Flag_Salida_Mora==1]));
    Ult_Salida_Mora <- c(as.IDate(NA),Poblacion_Morosa3[,FDES])[ifelse(salida.fdes>=salida.na,salida.fdes+1L,1L)];
});
##   user  system elapsed
##  0.328   0.047   0.374







## show result
head(cbind(Poblacion_Morosa3,Ult_Entrada_Mora,Ult_Salida_Mora),50L);
##     NUMDCRED       FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora
##  1:     0001 2011-01-01                 0                0       2011-01-01            <NA>
##  2:     0001 2011-01-02                 0                0       2011-01-01            <NA>
##  3:     0001 2011-01-03                 1                0       2011-01-03            <NA>
##  4:     0001 2011-01-04                 0                0       2011-01-03            <NA>
##  5:     0001 2011-01-05                 0                0       2011-01-03            <NA>
##  6:     0002 2011-01-06                 0                0       2011-01-06            <NA>
##  7:     0002 2011-01-07                 0                0       2011-01-06            <NA>
##  8:     0002 2011-01-08                 0                0       2011-01-06            <NA>
##  9:     0003 2011-01-09                 1                0       2011-01-09            <NA>
## 10:     0004 2011-01-10                 1                0       2011-01-10            <NA>
## 11:     0004 2011-01-11                 0                0       2011-01-10            <NA>
## 12:     0005 2011-01-12                 0                0       2011-01-12            <NA>
## 13:     0005 2011-01-13                 1                0       2011-01-13            <NA>
## 14:     0005 2011-01-14                 0                0       2011-01-13            <NA>
## 15:     0006 2011-01-15                 0                1       2011-01-15      2011-01-15
## 16:     0006 2011-01-16                 0                0       2011-01-15      2011-01-15
## 17:     0006 2011-01-17                 0                1       2011-01-15      2011-01-17
## 18:     0007 2011-01-18                 1                0       2011-01-18            <NA>
## 19:     0007 2011-01-19                 0                0       2011-01-18            <NA>
## 20:     0008 2011-01-20                 0                0       2011-01-20            <NA>
## 21:     0009 2011-01-21                 0                0       2011-01-21            <NA>
## 22:     0009 2011-01-22                 1                0       2011-01-22            <NA>
## 23:     0010 2011-01-23                 0                1       2011-01-23      2011-01-23
## 24:     0010 2011-01-24                 0                1       2011-01-23      2011-01-24
## 25:     0010 2011-01-25                 1                0       2011-01-25      2011-01-24
## 26:     0010 2011-01-26                 0                0       2011-01-25      2011-01-24
## 27:     0011 2011-01-27                 0                0       2011-01-27            <NA>
## 28:     0011 2011-01-28                 0                0       2011-01-27            <NA>
## 29:     0012 2011-01-29                 0                1       2011-01-29      2011-01-29
## 30:     0012 2011-01-30                 0                0       2011-01-29      2011-01-29
## 31:     0012 2011-01-31                 1                0       2011-01-31      2011-01-29
## 32:     0012 2011-02-01                 0                0       2011-01-31      2011-01-29
## 33:     0012 2011-02-02                 0                0       2011-01-31      2011-01-29
## 34:     0013 2011-02-03                 0                0       2011-02-03            <NA>
## 35:     0013 2011-02-04                 1                0       2011-02-04            <NA>
## 36:     0013 2011-02-05                 1                0       2011-02-05            <NA>
## 37:     0014 2011-02-06                 0                1       2011-02-06      2011-02-06
## 38:     0014 2011-02-07                 0                0       2011-02-06      2011-02-06
## 39:     0014 2011-02-08                 0                0       2011-02-06      2011-02-06
## 40:     0014 2011-02-09                 0                1       2011-02-06      2011-02-09
## 41:     0014 2011-02-10                 1                0       2011-02-10      2011-02-09
## 42:     0015 2011-02-11                 0                0       2011-02-11            <NA>
## 43:     0015 2011-02-12                 0                0       2011-02-11            <NA>
## 44:     0015 2011-02-13                 0                0       2011-02-11            <NA>
## 45:     0015 2011-02-14                 0                1       2011-02-11      2011-02-14
## 46:     0016 2011-02-15                 1                0       2011-02-15            <NA>
## 47:     0016 2011-02-16                 0                0       2011-02-15            <NA>
## 48:     0017 2011-02-17                 0                0       2011-02-17            <NA>
## 49:     0018 2011-02-18                 0                0       2011-02-18            <NA>
## 50:     0018 2011-02-19                 0                0       2011-02-18            <NA>
##     NUMDCRED       FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora






以下是您的新测试数据的演示:


Here's a demo on your new test data:

## libs
library(data.table);

## generate test data
Poblacion_Morosa3 <- data.table(
    NUMDCRED=c('0001','0001','0001','0002','0002','0002','0003','0003','0003'),
    FDES=c('2012-01-01','2012-03-01','2012-04-01','2011-01-01','2011-02-01','2011-03-01','2012-05-01','2012-06-01','2012-07-01'),
    Flag_Entrada_Mora=c(0,1,0,0,0,0,0,0,0),
    Flag_Salida_Mora=c(0,0,0,0,0,0,0,1,0)
);
Poblacion_Morosa3[,FDES:=as.IDate(FDES)]; ## require correct type for FDES

## solution
system.time({
    findLastIndex <- function(iall,imark) c(0L,imark)[findInterval(iall,imark)+1L];
    n2 <- nrow(Poblacion_Morosa3);
    row.seq <- seq_len(n2);
    num.start <- c(T,Poblacion_Morosa3[,NUMDCRED[-.N]!=NUMDCRED[-1L]]);
    entrada.fdes <- findLastIndex(row.seq,which(num.start | Poblacion_Morosa3[,Flag_Entrada_Mora==1]));
    Ult_Entrada_Mora <- Poblacion_Morosa3[entrada.fdes,FDES];
    salida.na <- findLastIndex(row.seq,which(num.start));
    salida.fdes <- findLastIndex(row.seq,which(Poblacion_Morosa3[,Flag_Salida_Mora==1]));
    Ult_Salida_Mora <- c(as.IDate(NA),Poblacion_Morosa3[,FDES])[ifelse(salida.fdes>=salida.na,salida.fdes+1L,1L)];
});
##   user  system elapsed
##  0.000   0.000   0.003







## show result
cbind(Poblacion_Morosa3,Ult_Entrada_Mora,Ult_Salida_Mora);
##    NUMDCRED       FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora
## 1:     0001 2012-01-01                 0                0       2012-01-01            <NA>
## 2:     0001 2012-03-01                 1                0       2012-03-01            <NA>
## 3:     0001 2012-04-01                 0                0       2012-03-01            <NA>
## 4:     0002 2011-01-01                 0                0       2011-01-01            <NA>
## 5:     0002 2011-02-01                 0                0       2011-01-01            <NA>
## 6:     0002 2011-03-01                 0                0       2011-01-01            <NA>
## 7:     0003 2012-05-01                 0                0       2012-05-01            <NA>
## 8:     0003 2012-06-01                 0                1       2012-05-01      2012-06-01
## 9:     0003 2012-07-01                 0                0       2012-05-01      2012-06-01

这篇关于R编程:如何加速一个需要2个小时的循环,以及为什么它需要很多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆