使用data.table中的多个键来获得条件搜索 [英] Usage of multiple keys in data.table to obtain a conditioned search
问题描述
我首先有一个data.table
,我想根据某些条件获得一个子集,例如,我有
I have a data.table
first and I want to obtain a subset based on certain conditions, for example, I have
library(data.table)
dt <- data.table(rn=1:10, B=rep(1:2, 5))
dt
# rn B
# 1: 1 1
# 2: 2 2
# 3: 3 1
# 4: 4 2
# 5: 5 1
# 6: 6 2
# 7: 7 1
# 8: 8 2
# 9: 9 1
#10: 10 2`
我知道第一列的名称,但是我事先不知道第二列的名称,而是存储在字符向量中:
nameAsVect <- "B"
I know the first column name but I don't know in advance the name of the second column, which is instead stored in a character vector:
nameAsVect <- "B"
假设我要获得以下内容:
Let's say that I want to obtain the following:
dt[rn>5 & B==2, ]
# rn B
#1: 6 2
#2: 8 2
#3: 10 2`
我认为我可以做到:
setkeyv(dt, c("rn", nameAsVect))
max.count <- max(dt[, nameAsVect, with=FALSE])
dt[J(5:max(rn), max.count), ]
# rn B
#1: 5 2
#2: 6 2
#3: 7 2
#4: 8 2
#5: 9 2
#6: 10 2
但是我不明白为什么要在列rn
中包含值5、7和9.我可以通过以下方式获得所需的内容:dt[rn>=5 & get(nameAsVect) == max.count]
但是我认为,如果可行,第一种方法在使用大表时会更快.
But I can't understand why the values 5, 7, and 9 in column rn
are included. I can obtain what I want with: dt[rn>=5 & get(nameAsVect) == max.count]
but I thought that the first approach, if it worked, would have been faster with large tables.
有见识吗?
谢谢
推荐答案
OP方法有很多替代方案,不需要事先设置密钥
There are alternatives to OP's approach which do not require to set keys beforehand
dt[rn >= 5 & get(nameAsVect) == max(get(nameAsVect))]
rn B
1: 6 2
2: 8 2
3: 10 2
矢量扫描和eval(parse())
Matt Dowle在他对的回答中建议的另一种方法选择/分配给data.table变量,其名称存储在字符向量 :
Vector scan & eval(parse())
Another approach suggested by Matt Dowle in his answer to Select / assign to data.table variables which names are stored in a character vector:
eval(parse(text = sprintf("dt[rn >= 5 & %s == max(%s)]", nameAsVect, nameAsVect)))
rn B
1: 6 2
2: 8 2
3: 10 2
不参加比赛
使用版本v1.9.8(于2016年11月25日在CRAN上发布),data.table
已具有执行非等额联接的功能.
Non-equi join
With version v1.9.8 (on CRAN 25 Nov 2016), data.table
has gained the ability to do non-equi joins.
max.count <- dt[, max(get(nameAsVect))]
dt[dt[.(5, max.count), on = c("rn>=V1", paste0(nameAsVect, "==V2")), which = TRUE]]
rn B
1: 6 2
2: 8 2
3: 10 2
或(我的首选方式)
mdt <- dt[, c(.(rn = 5), lapply(.SD, max)), .SDcols = nameAsVect]
dt[dt[mdt, on = c("rn>=rn", nameAsVect), which = TRUE]]
rn B
1: 6 2
2: 8 2
3: 10 2
基准
创建基准数据:
Benchmark
Create benchmark data:
n_row <- 1e6L
set.seed(123L)
DT <- data.table(
rn = sample(1:10, n_row, TRUE),
B = sample(1:2, n_row, TRUE)
)
运行基准测试
library(microbenchmark)
bm <- microbenchmark(
vec_scan_hard_coded = {
dt <- copy(DT)
dt[rn >= 5L & B == 2L]
},
OP_keyed = {
dt <- copy(DT)
setkeyv(dt, c("rn", nameAsVect))
max.count <- max(dt[, nameAsVect, with=FALSE])
dt[J(5:max(rn), max.count), nomatch = 0L]
},
vec_scan_get = {
dt <- copy(DT)
dt[rn >= 5 & get(nameAsVect) == max(get(nameAsVect))]
},
vec_scan_eval_parse = {
dt <- copy(DT)
eval(parse(text = sprintf("dt[rn >= 5 & %s == max(%s)]", nameAsVect, nameAsVect)))
},
nej1 = {
dt <- copy(DT)
max.count <- dt[, max(get(nameAsVect))]
dt[dt[.(5, max.count), on = c("rn>=V1", paste0(nameAsVect, "==V2")), which = TRUE]]
},
nej1_keyed = {
dt <- copy(DT)
setkeyv(dt, c("rn", nameAsVect))
max.count <- dt[, max(get(nameAsVect))]
dt[dt[.(5, max.count), on = c("rn>=V1", paste0(nameAsVect, "==V2")), which = TRUE]]
},
nej2 = {
dt <- copy(DT)
mdt <- dt[, c(.(rn = 5), lapply(.SD, max)), .SDcols = nameAsVect]
dt[dt[mdt, on = c("rn>=rn", nameAsVect), which = TRUE]]
},
nej2_keyed = {
dt <- copy(DT)
setkeyv(dt, c("rn", nameAsVect))
mdt <- dt[, c(.(rn = 5), lapply(.SD, max)), .SDcols = nameAsVect]
dt[dt[mdt, on = c("rn>=rn", nameAsVect), which = TRUE]]
},
times = 100L
)
print(bm)
对于100万行和大约30万行的结果集,矢量扫描方法是最快的:
For 1 M rows and a result set which is approximately 300 k rows, the vector scan approaches are the fastest:
Unit: milliseconds
expr min lq mean median uq max neval cld
vec_scan_hard_coded 19.03159 20.86890 42.70820 24.38040 27.57417 219.5682 100 a
OP_keyed 31.49025 34.50825 52.46168 37.74204 40.84953 194.7676 100 a
vec_scan_get 20.60384 25.75461 46.37579 27.29287 29.55892 185.5867 100 a
vec_scan_eval_parse 20.81188 23.92598 36.81940 26.69742 29.27687 183.5323 100 a
nej1 53.85361 59.32608 85.32623 62.12509 65.15083 227.1221 100 b
nej1_keyed 52.89946 58.37457 77.38969 61.03312 64.32072 221.3292 100 b
nej2 53.25590 59.69762 88.92513 61.98481 65.05738 285.2495 100 b
nej2_keyed 53.25061 58.61453 81.22925 61.14885 63.56159 274.0207 100 b
这篇关于使用data.table中的多个键来获得条件搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!