重复子集化可以使此过程更快 [英] Repeated subsetting can you make this process faster

查看:81
本文介绍了重复子集化可以使此过程更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要循环遍历数据表2 DT2,并为每一行检查数据表1 DT1中的某些条件是否匹配。

I need to loop through data table2 DT2 and for each row check if there is a match on some conditions in data table1 DT1.

我现在这样做。这看起来DT1的数据我在DT2。在循环中你将看到:

I currently do this like this. This looks in DT1 for the data I have in DT2. In the loop you will see:

  subset = DT1[DT1$time == DT2$time[i] & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] )  & DT1$Flag ==0]

如果有匹配,我需要

1 - 标记在DT1中匹配的行,因此我不再匹配。在使用 DT1 $ Flag [match $ RowNumber] = 1完成的代码中

1 - flag the row that matched in DT1 so I do not match on it again. In the code that is done with DT1$Flag[match$RowNumber] = 1

2 - 填充列在DT2中与来自DT1中对应列的数据
在使用

2 - populate a column in DT2 with data from the corresponding column in DT1 In the code that is done with

  DT2$x[i]  = match$x 

这实际上是这样工作的,但是DT1可能有10,000行,DT2可能是100,000行,因此对于每个10,000行我正在子集。子集设置10,000次!

That's pretty much it and this works BUT DT1 might be 10,000 rows and DT2 might be 100,000 rows so for each 10,000 rows I am subsetting. Subsetting 10,000 times!

还有第二个要求,即循环具有匹配EXACTLY的匹配条件。在代码中,您看到

There is also a second requirement which is the loop has a match condition of the matching EXACTLY on time. In the code you see

DT1$time == DT2$time[i]

完成匹配后。那么时间条件被放宽,并且运行具有+/- 7秒的时间窗口的第二循环,因此在第二循环中,可以在DT1中找到距离DT2中的时间+/- 7秒的匹配。 p>

After that matching is done. that time condition is relaxed and a second loop is run that has a time window of +/- 7 seconds so in the second loop you may find a match in DT1 that is +/- 7 seconds away from the time in DT2.

DT1[  DT1$time >= DT2$time[i] -7 & DT1$time <= DT2$time[i]+7

数据表有一些行,子集化需要很长的时间。这两个循环可以以某种方式做得更快吗?

This works as well but again because the data tables have some many rows the subsetting takes a very long time. Can these 2 loops be made faster somehow?

注意子集中的OR条件会使这个棘手。

Notice the OR condition in the subset which makes this tricky.

 (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] ) 

示例代码在下面可以运行。谢谢。

Sample code is below which you can run. Thank you.

############# Here is the setup of the datatables
times= rep(as.POSIXct("2016-01-01",tz="GMT")+seq(1,10,by = 1),2)
times= times[order(times)]
DT1 = data.table(time = times, a = c(1,seq(1,19,1)) , b = c(11,seq(11,29,1)) , cat= c("a","a",  rep(c("a","b"),each=9) ) ,Flag =rep(0,20)  ,x = seq(201,220,1) )
DT1$RowNumber = seq(1,dim(DT1)[1],1)
DT2 = data.table(time = as.POSIXct(c("2016-01-01 00:00:01","2016-01-01 00:00:10","2016-01-01 00:00:10"),tz = "GMT"), a = c(1,19,10),b=c(11,29,20), cat1 = c("a","x","b"), x = c(0,0,0),MatchType = c("none","none","none"), cat2=c("a","b","a"))


 ######### This is the for loop that does the matching
#If there is a match i.e. dim(subset)[1]>0 two things happen
# 2 - flag the row used in DT1 so it is not used again...notice DT1$Flag ==0 is used in the subset
# 1 - populate column x in dt2 from column x in dt1
for(i in 1:dim(DT2)[1])#loop over rows of dt2
{
  #i =1
  subset = DT1[DT1$time == DT2$time[i] & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] )  & DT1$Flag ==0] #lookin dt2 for the dt1 data
  if(dim(subset)[1]>0)
  {
    match = head(subset,1) # if there are multiple matches only use the 1st one
    DT1$Flag[match$RowNumber] = 1 #flag the row used in DT1 so it is not used again
    DT2$MatchType[i]  = "First Loop"#populate column x in dt2 from column x in dt1
    DT2$x[i]  = match$x #populate column x in dt2 from column x in dt1
  }
}
##### after that loop some rows in DT2 will not have a match. In this case the last row has HasAMatch = 0
DT2 # NOTE HERE that  the last row has Match Type = none because a match could not be found
DT1 # NOTE the flag column has a 1 in the first and last rows which was set in the loop when the match occured

##### Now a second loop is done this time trying to match within a time window +-7 seconds instead of a matching EXACTLEY on time
firstloop = DT2[DT2$MatchType != "none",] ### this removes any of the rows ALREADY MATCHED IN THE FIRST LOOP
DT2 =DT2[DT2$MatchType == "none",] ### this is used in the loop below and has the rows that have NOT been matched yet
DT1 = DT1[DT1$Flag == 0,] ## this again removes rows from DT1 that have already been matched
DT1$RowNumber = seq(1,dim(DT1)[1],1)
for(i in 1:dim(DT2)[1])#loop over rows of dt2
{
  i=1
  subset = DT1[  DT1$time >= DT2$time[i] -7 & DT1$time <= DT2$time[i]+7  & DT1$a == DT2$a[i] & DT1$b == DT2$b[i] & (DT1$cat == DT2$cat1[i] | DT1$cat == DT2$cat2[i] )  & DT1$Flag ==0] #lookin dt2 for the dt1 data
  if(dim(subset)[1]>0)
  {
    match = head(subset,1) # if there are multiple matches only use the 1st one
    DT1$Flag[match$RowNumber] = 1 #flag the row used in DT1 so it is not used again
    DT2$MatchType[i]  = "Second Loop" #populate column x in dt2 from column x in dt1
    DT2$x[i]  = match$x #populate column x in dt2 from column x in dt1
  }
}

# now the process is finished
rbind(firstloop, DT2) # NOTE now you can see the match type of "second loop" for the last row
DT1 # NOTE the flag in row 10 because that was the row used in the match


推荐答案

我应该提到,你必须尝试防止在您的数据集中使用函数名称af变量名称: cat subset 是R中的函数,因此我使用 cat0 而不是 cat subs 而不是 subset 。您的代码有几个可能的改进:

First I should mention that you have to try to prevent from using function names af variable names in your datasets: cat and subset are a functions in R, therefore I used cat0 instead of cat and subs instead of subset in this answer. There are several possible improvements to your code:

创建示例数据

特别是创建 RowNumber 变量可以使用 .I 更高效。此外,我还给了 DT2 一个rownumber变量,因为这在后续步骤中是有用的:

Especially create the RowNumber variable can be done more efficient with .I. Furthermore, I've also given DT2 a rownumber variable as this is usefull in the next steps:

times <- rep(as.POSIXct("2016-01-01",tz="GMT") + seq(1,10,by = 1), 2)
times <- times[order(times)]

DT1 <- data.table(time = times, 
                  a = c(1,1:19), 
                  b = c(11,11:29), 
                  cat0 = c("a","a", rep(c("a","b"), each=9)),
                  Flag = rep(0,20),
                  x = seq(201,220,1))[, rn := .I]

DT2 <- data.table(time = as.POSIXct(c("2016-01-01 00:00:01","2016-01-01 00:00:10","2016-01-01 00:00:10"), tz="GMT"), 
                  a = c(1,19,10),
                  b = c(11,29,20), 
                  cat1 = c("a","x","b"), 
                  x = c(0,0,0),
                  MatchType = c("none","none","none"), 
                  cat2 = c("a","b","a"))[, rn := .I]

第一个循环:

这可以通过取出 DT1 的更新并使用更新:= (这可能会比在for循环中更有效率)的 data.table 包:

This can be simplified by taking out the updating of DT1 and making use of the update by reference possibility of the data.table package with := (which will probably be much more efficient than doing it inside the for-loop):

for(i in 1:nrow(DT2))
{
  subs <- DT1[time == DT2$time[i] & 
                a == DT2$a[i] & 
                b == DT2$b[i] & 
                (cat0 == DT2$cat1[i] | cat0 == DT2$cat2[i])
              & Flag == 0] 
  if(nrow(subs) > 0)
  {
    DT2[i, `:=` (MatchType = 'First Loop', x = subs$x[1])]
  }
}

DT1 :可以通过创建索引并再次使用:= 更新:

idx1 <- DT1[(time %in% DT2$time) & (a %in% DT2$a) & (b %in% DT2$b) & 
              (cat0 %in% DT2$cat1 | cat0 %in% DT2$cat2) & (Flag == 0), 
            .SD[1], 
            .(time,a,b,cat0,Flag)]$rn
DT1[idx1, Flag := 1]

第二个循环

for(i in DT2[MatchType == "none"]$rn)   # here we need the rownumber variable for DT2
{
  subs <- DT1[time >= DT2$time[i]-7 & 
                time <= DT2$time[i]+7 & 
                a == DT2$a[i] & 
                b == DT2$b[i] & 
                (cat0 == DT2$cat1[i] | cat0 == DT2$cat2[i] )  
              & Flag == 0]
  if(nrow(subs) > 0)
  {
    DT1[subs$rn[1], Flag := 2]
    DT2[i, `:=` (MatchType = 'Second Loop', x = subs$x[1])]
  }
}

这些改进不需要创建中间子集和 rbind 步骤。最终结果:

These improvements remove the need for creating the intermediate subset and rbind steps. The final result:

> DT2
                  time  a  b cat1   x   MatchType cat2 rn
1: 2016-01-01 00:00:01  1 11    a 201  First Loop    a  1
2: 2016-01-01 00:00:10 19 29    x 220  First Loop    b  2
3: 2016-01-01 00:00:10 10 20    b 211 Second Loop    a  3

> DT1
                   time  a  b cat0 Flag   x rn
 1: 2016-01-01 00:00:01  1 11    a    1 201  1
 2: 2016-01-01 00:00:01  1 11    a    0 202  2
 3: 2016-01-01 00:00:02  2 12    a    0 203  3
 4: 2016-01-01 00:00:02  3 13    a    0 204  4
 5: 2016-01-01 00:00:03  4 14    a    0 205  5
 6: 2016-01-01 00:00:03  5 15    a    0 206  6
 7: 2016-01-01 00:00:04  6 16    a    0 207  7
 8: 2016-01-01 00:00:04  7 17    a    0 208  8
 9: 2016-01-01 00:00:05  8 18    a    0 209  9
10: 2016-01-01 00:00:05  9 19    a    0 210 10
11: 2016-01-01 00:00:06 10 20    a    2 211 11
12: 2016-01-01 00:00:06 11 21    b    0 212 12
13: 2016-01-01 00:00:07 12 22    b    0 213 13
14: 2016-01-01 00:00:07 13 23    b    0 214 14
15: 2016-01-01 00:00:08 14 24    b    0 215 15
16: 2016-01-01 00:00:08 15 25    b    0 216 16
17: 2016-01-01 00:00:09 16 26    b    0 217 17
18: 2016-01-01 00:00:09 17 27    b    0 218 18
19: 2016-01-01 00:00:10 18 28    b    0 219 19
20: 2016-01-01 00:00:10 19 29    b    1 220 20

这篇关于重复子集化可以使此过程更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆