使用二进制搜索在data.table中组合条件集合来提取值 [英] Combine set of conditions in data.table to extract value using binary search
问题描述
由于我可怕的执行和解释我的上一个问题我将重新开始,并将尝试将问题尽可能简短地概括。
Since my terrible execution and interpretation of my previous question I'll start over and will try to formulate the question as short and general possible.
我有两个数据框架(见下面的例子)。每个数据集包含相同数量的列。
I have two dataframes (see the examples below). Each dataset contains the same number of columns.
tc <- textConnection('
ID Track1 Track2 Track3 Track4 Time Loc
4 15 "" "" 50 40 1
5 17 115 109 55 50 1
6 17 115 109 55 60 1
7 13 195 150 60 70 1
8 13 195 150 60 80 1
9 "" "" 181 70 90 2 #From this row, example data added
10 "" "" 182 70 92 2
11 429 31 "" 80 95 3
12 480 31 12 80 96 3
13 118 "" "" 90 100 4
14 120 16 213 90 101 4
')
MATCHINGS <- read.table(tc, header=TRUE)
tc <- textConnection('
ID Track1 Track2 Track3 Track4 Time Loc
"" 15 "" "" 50 40 1
"" 17 "" 109 55 50 1
"" 17 432 109 55 65 1
"" 17 115 109 55 59 1
"" 13 195 150 60 68 1
"" 13 195 150 60 62 1
"" 10 5 1 10 61 3
"" 13 195 150 60 72 1
"" 40 "" 181 70 82 2 #From this row, example data added
"" "" "" 182 70 85 2
"" 429 "" "" 80 90 3
"" "" 31 12 80 92 3
"" "" "" "" 90 95 4
"" 118 16 213 90 96 4
')
INVOLVED <- read.table(tc, header=TRUE)
目标是将最近的ID从 MATCHINGS
匹配 Track1
到 Track4
,将$ c>转换为 INVOLVED
和 Loc
。额外的条件是匹配的 INVOLVED
条目的 Time
可能不高于
。此外,在 MATCHING
中的条目的时间 Track1
上的匹配是最优选的,在 Track4
上的匹配是最不优选的。但只有 Track4
始终可用(所有其他跟踪
- 列可以为空)。因此,预期结果是:
The goal is to place the least recent ID from MATCHINGS
into INVOLVED
by matching on Track1
to Track4
and Loc
. An extra condition is that the Time
of the matching INVOLVED
entry may not be higher than the Time
of the entry in MATCHING
. Furthermore a match on Track1
is most preferred, a match on Track4
is least preferred. However only Track4
is always available (all other Track
-columns can be empty). Thus the expected results are:
ID Track1 Track2 Track3 Track4 Time Loc
4 15 "" "" 50 40 1
5 17 "" 109 55 50 1
"" 17 432 109 55 65 1
6 17 115 109 55 59 1
7 13 195 150 60 68 1
7 13 195 150 60 62 1
"" 10 5 1 10 61 3
8 13 195 150 60 72 1
9 40 "" 181 70 82 2 #From this row, example data added
10 "" "" 182 70 85 2
11 429 "" "" 80 90 3
12 "" 31 12 80 92 3
13 "" "" "" 90 95 4
13 118 16 213 90 96 4
c $ c> data.table 包,但无法做到这一点。是否可以摆脱向量扫描并有效地遍历数据而不循环?
I tried to this with the data.table
package, but fail in doing this efficient. Is it possible to get rid of the vector scans and efficiently go through the data without looping?
dat <- data.table(MATCHINGS)
for(i in 1:nrow(INVOLVED)){
row <- INVOLVED[i,]
match <- dat[Time>=row$Time][Loc==row$Loc][Track4==row$Track4][Track4!=""][order(Time)][1]
if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
match <- dat[Time>=row$Time][Loc==row$Loc][Track3==row$Track3][Track3!=""][order(Time)][1]
if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
match <- dat[Time>=row$Time][Loc==row$Loc][Track2==row$Track2][Track2!=""][order(Time)][1]
if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
match <- dat[Time>=row$Time][Loc==row$Loc][Track1==row$Track1][Track1!=""][order(Time)][1]
if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
}
更新
更新了示例数据,显示需要 Track 1 to 3
。如图所示 Track1
是最重要的, Track4
最不重要。即使 Track1到3
匹配 MATCHINGS x
和 Track4
匹配 MATCHINGS y
, ID
的 y
分配给 INVOLVED行
。因此: Track3
匹配重写 Track4
匹配, Track2
Track3
匹配, Track1
匹配重写 Track2
Updated the example data showing the need for Track 1 to 3
. As shown Track1
is most important and Track4
least important. Even if Track1 to 3
match to MATCHINGS x
and Track4
matches to MATCHINGS y
, the ID
of y
should be assigned to that INVOLVED row
. So: Track3
match overrides Track4
match, Track2
match overrides Track3
match, Track1
match overrides Track2
match.
推荐答案
使用滚动参数也可以滚动下一个观察向后与新的( v1.9.6 +
) on =
参数,我们可以做得更简单:
With roll argument able to also roll next observation backward along with the new (v1.9.6+
) on=
argument, we can do this much more straightforward:
require(data.table)
setDT(MATCHINGS)
setDT(INVOLVED)
INVOLVED[ , ID := MATCHINGS[INVOLVED, ID, roll=-Inf,
mult="first", on=c("Loc", "Track4", "Time")]]]
$ b
这是一个 data.table code> -ish开始。这只使用Track 4(不是1到3),但仍然会产生请求的输出。
Here's a data.table
-ish start. This only uses Track 4 (not 1 to 3) but it still appears to produce the requested output.
M = as.data.table(MATCHINGS)
I = as.data.table(INVOLVED)
M[,Time:=-Time]
I[,Time:=-Time]
setkey(M,Loc,Track4,Time)
I[,ID:={i=list(Loc,Track4,Time);M[i,ID,roll=TRUE,mult="first"]}][,Time:=-Time]
ID Track1 Track2 Track3 Track4 Time Loc
1: 1 NA 105 NA 35 1 1
2: 1 NA NA NA 35 2 1
3: 1 26 105 NA 35 3 1
4: 2 NA NA NA 40 20 1
5: 2 134 1 6 40 20 1
6: 3 13 109 NA 45 30 1
7: 4 15 NA NA 50 40 1
8: 5 17 NA 109 55 50 1
9: NA 17 432 109 55 65 1
10: 6 17 115 109 55 59 1
11: 7 13 195 150 60 68 1
12: 7 13 195 150 60 62 1
13: NA 10 5 1 10 61 3
14: 8 13 195 150 60 72 1
有趣的问题!如果这似乎确定,请更改示例数据需要轨道1到3.或者你可以从这里。
Interesting question! If this seems ok, please change the example data to need tracks 1 to 3. Or perhaps you can take it from here.
这篇关于使用二进制搜索在data.table中组合条件集合来提取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!