使用data.table中的多个键来获得条件搜索 [英] Usage of multiple keys in data.table to obtain a conditioned search

查看:89
本文介绍了使用data.table中的多个键来获得条件搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我首先有一个data.table,我想根据某些条件获得一个子集,例如,我有

I have a data.table first and I want to obtain a subset based on certain conditions, for example, I have

library(data.table)
dt <- data.table(rn=1:10, B=rep(1:2, 5))
dt
#    rn B
# 1:  1 1
# 2:  2 2
# 3:  3 1
# 4:  4 2
# 5:  5 1
# 6:  6 2
# 7:  7 1
# 8:  8 2
# 9:  9 1
#10: 10 2`

我知道第一列的名称,但是我事先不知道第二列的名称,而是存储在字符向量中: nameAsVect <- "B"

I know the first column name but I don't know in advance the name of the second column, which is instead stored in a character vector: nameAsVect <- "B"

假设我要获得以下内容:

Let's say that I want to obtain the following:

dt[rn>5 & B==2, ]
#   rn B
#1:  6 2
#2:  8 2
#3: 10 2`

我认为我可以做到:

setkeyv(dt, c("rn", nameAsVect))
max.count <- max(dt[, nameAsVect, with=FALSE])
dt[J(5:max(rn), max.count), ]
#   rn B
#1:  5 2
#2:  6 2
#3:  7 2
#4:  8 2
#5:  9 2
#6: 10 2

但是我不明白为什么要在列rn中包含值5、7和9.我可以通过以下方式获得所需的内容:dt[rn>=5 & get(nameAsVect) == max.count] 但是我认为,如果可行,第一种方法在使用大表时会更快.

But I can't understand why the values 5, 7, and 9 in column rn are included. I can obtain what I want with: dt[rn>=5 & get(nameAsVect) == max.count] but I thought that the first approach, if it worked, would have been faster with large tables.

有见识吗?

谢谢

推荐答案

OP方法有很多替代方案,不需要事先设置密钥

There are alternatives to OP's approach which do not require to set keys beforehand

dt[rn >= 5 & get(nameAsVect) == max(get(nameAsVect))]

   rn B
1:  6 2
2:  8 2
3: 10 2

矢量扫描和eval(parse())

Matt Dowle在他对的回答中建议的另一种方法选择/分配给data.table变量,其名称存储在字符向量 :

Vector scan & eval(parse())

Another approach suggested by Matt Dowle in his answer to Select / assign to data.table variables which names are stored in a character vector:

eval(parse(text = sprintf("dt[rn >= 5 & %s == max(%s)]", nameAsVect, nameAsVect)))

   rn B
1:  6 2
2:  8 2
3: 10 2

不参加比赛

使用版本v1.9.8(于2016年11月25日在CRAN上发布),data.table已具有执行非等额联接的功能.

Non-equi join

With version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins.

max.count <- dt[, max(get(nameAsVect))] 
dt[dt[.(5, max.count), on = c("rn>=V1", paste0(nameAsVect, "==V2")), which = TRUE]]

   rn B
1:  6 2
2:  8 2
3: 10 2

或(我的首选方式)

mdt <- dt[, c(.(rn = 5), lapply(.SD, max)), .SDcols = nameAsVect] 
dt[dt[mdt, on = c("rn>=rn", nameAsVect), which = TRUE]]

   rn B
1:  6 2
2:  8 2
3: 10 2

基准

创建基准数据:

Benchmark

Create benchmark data:

n_row <- 1e6L
set.seed(123L)
DT <- data.table(
  rn = sample(1:10, n_row, TRUE),
  B  = sample(1:2,  n_row, TRUE)
)

运行基准测试

library(microbenchmark)
bm <- microbenchmark(
  vec_scan_hard_coded = {
    dt <- copy(DT)
    dt[rn >= 5L & B == 2L]
  },
  OP_keyed = {
    dt <- copy(DT)
    setkeyv(dt, c("rn", nameAsVect))
    max.count <- max(dt[, nameAsVect, with=FALSE])
    dt[J(5:max(rn), max.count), nomatch = 0L]
  },
  vec_scan_get = {
    dt <- copy(DT)
    dt[rn >= 5 & get(nameAsVect) == max(get(nameAsVect))]
  },
  vec_scan_eval_parse = {
    dt <- copy(DT)
    eval(parse(text = sprintf("dt[rn >= 5 & %s == max(%s)]", nameAsVect, nameAsVect)))
  },
  nej1 = {
    dt <- copy(DT)
    max.count <- dt[, max(get(nameAsVect))] 
    dt[dt[.(5, max.count), on = c("rn>=V1", paste0(nameAsVect, "==V2")), which = TRUE]]
  },
  nej1_keyed = {
    dt <- copy(DT)
    setkeyv(dt, c("rn", nameAsVect))
    max.count <- dt[, max(get(nameAsVect))] 
    dt[dt[.(5, max.count), on = c("rn>=V1", paste0(nameAsVect, "==V2")), which = TRUE]]
  },
  nej2 = {
    dt <- copy(DT)
    mdt <- dt[, c(.(rn = 5), lapply(.SD, max)), .SDcols = nameAsVect] 
    dt[dt[mdt, on = c("rn>=rn", nameAsVect), which = TRUE]]
  },
  nej2_keyed = {
    dt <- copy(DT)
    setkeyv(dt, c("rn", nameAsVect))
    mdt <- dt[, c(.(rn = 5), lapply(.SD, max)), .SDcols = nameAsVect] 
    dt[dt[mdt, on = c("rn>=rn", nameAsVect), which = TRUE]]
  },
  times = 100L
)
print(bm)

对于100万行和大约30万行的结果集,矢量扫描方法是最快的:

For 1 M rows and a result set which is approximately 300 k rows, the vector scan approaches are the fastest:

Unit: milliseconds
                expr      min       lq     mean   median       uq      max neval cld
 vec_scan_hard_coded 19.03159 20.86890 42.70820 24.38040 27.57417 219.5682   100  a 
            OP_keyed 31.49025 34.50825 52.46168 37.74204 40.84953 194.7676   100  a 
        vec_scan_get 20.60384 25.75461 46.37579 27.29287 29.55892 185.5867   100  a 
 vec_scan_eval_parse 20.81188 23.92598 36.81940 26.69742 29.27687 183.5323   100  a 
                nej1 53.85361 59.32608 85.32623 62.12509 65.15083 227.1221   100   b
          nej1_keyed 52.89946 58.37457 77.38969 61.03312 64.32072 221.3292   100   b
                nej2 53.25590 59.69762 88.92513 61.98481 65.05738 285.2495   100   b
          nej2_keyed 53.25061 58.61453 81.22925 61.14885 63.56159 274.0207   100   b

这篇关于使用data.table中的多个键来获得条件搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆