使用二进制搜索在data.table中组合条件集合来提取值 [英] Combine set of conditions in data.table to extract value using binary search

查看:71
本文介绍了使用二进制搜索在data.table中组合条件集合来提取值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于我可怕的执行和解释我的上一个问题我将重新开始,并将尝试将问题尽可能简短地概括。

Since my terrible execution and interpretation of my previous question I'll start over and will try to formulate the question as short and general possible.

我有两个数据框架(见下面的例子)。每个数据集包含相同数量的列。

I have two dataframes (see the examples below). Each dataset contains the same number of columns.

tc <- textConnection('
ID  Track1  Track2  Track3  Track4  Time    Loc
4   15      ""      ""      50      40      1   
5   17      115     109     55      50      1   
6   17      115     109     55      60      1   
7   13      195     150     60      70      1
8   13      195     150     60      80      1
9   ""      ""      181     70      90      2 #From this row, example data added
10  ""      ""      182     70      92      2
11  429     31      ""      80      95      3
12  480     31      12      80      96      3 
13  118     ""      ""      90      100     4
14  120     16      213     90      101     4   
')

MATCHINGS <- read.table(tc, header=TRUE)

tc <- textConnection('
ID  Track1  Track2  Track3  Track4  Time    Loc
""  15      ""      ""      50      40      1   
""  17      ""     109      55      50      1
""  17      432    109      55      65      1   
""  17      115     109     55      59      1       
""  13      195     150     60      68      1
""  13      195     150     60      62      1
""  10      5       1       10      61      3
""  13      195     150     60      72      1
""  40      ""      181     70      82      2 #From this row, example data added
""  ""      ""      182     70      85      2
""  429     ""      ""      80      90      3
""  ""      31      12      80      92      3
""  ""      ""      ""      90      95      4
""  118     16      213     90      96      4
')

INVOLVED <- read.table(tc, header=TRUE)

目标是将最近的ID从 MATCHINGS 匹配 Track1 Track4 ,将$ c>转换为 INVOLVED Loc 。额外的条件是匹配的 INVOLVED 条目的 Time 可能不高于 MATCHING 中的条目的时间。此外,在 Track1 上的匹配是最优选的,在 Track4 上的匹配是最不优选的。但只有 Track4 始终可用(所有其他跟踪 - 列可以为空)。因此,预期结果是:

The goal is to place the least recent ID from MATCHINGS into INVOLVED by matching on Track1 to Track4 and Loc. An extra condition is that the Time of the matching INVOLVED entry may not be higher than the Time of the entry in MATCHING. Furthermore a match on Track1 is most preferred, a match on Track4 is least preferred. However only Track4 is always available (all other Track-columns can be empty). Thus the expected results are:

ID Track1 Track2 Track3 Track4 Time Loc
4     15     ""     ""     50   40   1
5     17     ""    109     55   50   1
""    17    432    109     55   65   1
6     17    115    109     55   59   1
7     13    195    150     60   68   1
7     13    195    150     60   62   1
""    10      5      1     10   61   3
8     13    195    150     60   72   1
9     40     ""    181     70   82   2 #From this row, example data added
10    ""     ""    182     70   85   2
11    429    ""     ""     80   90   3
12    ""     31     12     80   92   3
13    ""     ""     ""     90   95   4 
13    118    16    213     90   96   4

c $ c> data.table 包,但无法做到这一点。是否可以摆脱向量扫描并有效地遍历数据而不循环?

I tried to this with the data.table package, but fail in doing this efficient. Is it possible to get rid of the vector scans and efficiently go through the data without looping?

dat <- data.table(MATCHINGS)
for(i in 1:nrow(INVOLVED)){
    row <- INVOLVED[i,]
    match <- dat[Time>=row$Time][Loc==row$Loc][Track4==row$Track4][Track4!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track3==row$Track3][Track3!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track2==row$Track2][Track2!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track1==row$Track1][Track1!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
}

更新

更新了示例数据,显示需要 Track 1 to 3 。如图所示 Track1 是最重要的, Track4 最不重要。即使 Track1到3 匹配 MATCHINGS x Track4 匹配 MATCHINGS y ID y 分配给 INVOLVED行。因此: Track3 匹配重写 Track4 匹配, Track2 Track3 匹配, Track1 匹配重写 Track2

Updated the example data showing the need for Track 1 to 3. As shown Track1 is most important and Track4 least important. Even if Track1 to 3 match to MATCHINGS x and Track4 matches to MATCHINGS y, the ID of y should be assigned to that INVOLVED row. So: Track3 match overrides Track4 match, Track2 match overrides Track3 match, Track1 match overrides Track2 match.

推荐答案

使用滚动参数也可以滚动下一个观察向后与新的( v1.9.6 + on = 参数,我们可以做得更简单:

With roll argument able to also roll next observation backward along with the new (v1.9.6+) on= argument, we can do this much more straightforward:

require(data.table)
setDT(MATCHINGS)
setDT(INVOLVED)
INVOLVED[ , ID := MATCHINGS[INVOLVED, ID, roll=-Inf, 
                    mult="first", on=c("Loc", "Track4", "Time")]]]

$ b

这是一个 data.table code> -ish开始。这只使用Track 4(不是1到3),但仍然会产生请求的输出。

Here's a data.table-ish start. This only uses Track 4 (not 1 to 3) but it still appears to produce the requested output.

M = as.data.table(MATCHINGS)
I = as.data.table(INVOLVED)
M[,Time:=-Time]
I[,Time:=-Time]
setkey(M,Loc,Track4,Time)
I[,ID:={i=list(Loc,Track4,Time);M[i,ID,roll=TRUE,mult="first"]}][,Time:=-Time]

    ID Track1 Track2 Track3 Track4 Time Loc
 1:  1     NA    105     NA     35    1   1
 2:  1     NA     NA     NA     35    2   1
 3:  1     26    105     NA     35    3   1
 4:  2     NA     NA     NA     40   20   1
 5:  2    134      1      6     40   20   1
 6:  3     13    109     NA     45   30   1
 7:  4     15     NA     NA     50   40   1
 8:  5     17     NA    109     55   50   1
 9: NA     17    432    109     55   65   1
10:  6     17    115    109     55   59   1
11:  7     13    195    150     60   68   1
12:  7     13    195    150     60   62   1
13: NA     10      5      1     10   61   3
14:  8     13    195    150     60   72   1

有趣的问题!如果这似乎确定,请更改示例数据需要轨道1到3.或者你可以从这里。

Interesting question! If this seems ok, please change the example data to need tracks 1 to 3. Or perhaps you can take it from here.

这篇关于使用二进制搜索在data.table中组合条件集合来提取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆