使用%chin%对自动索引的data.table的字符列进行子集化会提高速度吗? [英] Will using %chin% to subset character columns of an auto-indexed data.table ever improve speed?

查看:53
本文介绍了使用%chin%对自动索引的data.table的字符列进行子集化会提高速度吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TLDR::使用使用自动索引的最新版本的 data.table 时,使用%chin%有什么好处?在字符列上将data.table子集化?

过去,当对字符向量进行子集设置时,使用 data.table 中的%chin%代替%in%显着提高速度.在 data.table 的较新版本中,子设置时会在非键列上自动创建二级索引.这些索引的创建和使用似乎使%chin%%in%之间的任何速度差异无关紧要.

展望未来,在任何情况下都可以使用%chin%来对数据进行子集化.表仍然可以提高速度,或者我可以只使用%in%未来?


更新:在


我当前正在运行data.table版本1.10.5,建于2018-03-17 07:30:06 UTC.

解决方案

为结束此处的操作,在此处提供答案:我已经尽力浏览了Github上的相关提交,但是肯定有可能我错过了一些细微的差别.如果有任何贡献者想澄清一下,我很乐意在这里进行更新.

PR#2494上的对话:更好的复合查询子集优化似乎支持一种理解,即在data.table调用环境中进行评估时,%chin%的执行方法已发生根本性的改变.

  1. 如果对同一列多次使用子集来对 data.table 进行子设置,则可以通过自动索引来显着提高性能,因此请坚持使用默认选项: options(datatable.auto.index = TRUE) options(datatable.use.index = TRUE).在这种情况下,使用%chin%不会导致任何性能改进.

  2. 如果性能至关重要,并且您知道您将只一次在给定列上进行子设置(因此不会受益于生成索引所花费的时间),使用 options(datatable.auto.index = FALSE)自动建立索引,可以略微加快结果的速度.如果要执行重复的子集,您仍然可以像往常一样手动创建索引和键,但是优化的负担将由用户承担,以适当地使用 setkey() setindex().关闭自动索引编制后,使用%chin%会比%in%快.

  3. 在测试 data.table 调用环境之外的字符向量中是否存在元素时,%chin%仍比%in%

TLDR: When using more recent versions of data.table that use auto-indexing, is there any benefit from using %chin% to subset a data.table on character columns?

In the past, using %chin% from data.table in place of %in% when subsetting on character vectors resulted in a significant speed-up. In newer versions of data.table, secondary indices are automatically created on non-key columns when sub-setting. The creation and usage of these indices appears to make any speed difference between %chin% and %in% irrelevant.

Going forward, are there any cases where using %chin% to subset a data.table will still improve speed, or can I just use %in% in the future?


Update: The conversation on PR#2494: Better subsetting optimization for compound queries seems to support an understanding that when evaluated in the data.table calling environment, the execution methods of %chin% have been fundamentally changed.

For cases where the column used to subset the table more than once, performance will be drastically increased by auto-indexing, but when only used a single time (and thus not benefiting from the time spent generating the index), turning auto-indexing off will sometimes give marginally faster results.

I'll leave this open for a couple days, but I may flesh this out into an answer for the sake of posterity.


Data Generation and Benchmarking

The data generated is made up of a randomly ordered combination of two unbalanced samples:

  • 10 million single character strings, 26 possible unique values
  • 1 million four character strings, 456,976 possible unique values

The intent here is to be representative of non-normal variables that are dominated by a few common values, but have numerous less common possibilities.

library(data.table)
library(microbenchmark)
set.seed(1234)

## Create a vector of 1 million 4 character strings
## with 456,976 possible unique values 
DiverseSize <- 1e6
Diverse <- paste0(sample(LETTERS,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE))

## Create a vector of 10 million single character strings
## with 26 possible unique values
CommonSize  <- 1e7
Common <-  sample(LETTERS,CommonSize,replace = TRUE)

## Mix them into a data.table column, "x"
DT1 <- data.table(x = sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE))
## Make a deep copy to run independent comparisons
DT2 <- copy(DT1)

Comparing %in% and %chin%

When executing outside of the data.table environment, we still get a significant speed-up by using %chin%.

microbenchmark(
  Outside_chin = length(which(DT1[["x"]] %chin% c("Matt"))),
  Outside_in   = length(which(DT2[["x"]] %in% c("Matt"))),
  times = 1
)

...

Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval
 Outside_chin 254.5967 254.5967 254.5967 254.5967 254.5967 254.5967     1
   Outside_in 476.2117 476.2117 476.2117 476.2117 476.2117 476.2117     1

Comparing %in% and %chin%with auto-indexing

## Benchmarking -------
## Turn off Indices
options(datatable.auto.index = FALSE)
options(datatable.use.index = FALSE)

## Run without indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

## Run Again
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

options(datatable.auto.index = TRUE)
options(datatable.use.index = TRUE)

## First run builds indices and takes longer
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

## Run again, benefiting from pre-built indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

Profiling Results

When using ProfVis to analyze the run-time of each expression, the following is apparent:

  1. Without auto-indexing, %chin% is faster and run-time is similar the first and second time.
  2. With auto-indexing, run-time is basically the same whether %chin% or %in% is used.
  3. While the auto-indexing makes the first execution slightly slower, the second call evaluates so fast profvis can't even catch it consistently with a 10 ms sample interval


I am currently running data.table version 1.10.5, built 2018-03-17 07:30:06 UTC.

解决方案

Providing an answer here for the sake of closure: I've done my best to skim through the relevant commits on Github, but there's certainly a possibility I'm missing some of the finer nuances. If any of the contributors would like to make some clarifications, I'd be happy to update here.

The conversation on PR#2494: Better subsetting optimization for compound queries seems to support an understanding that when evaluated in the data.table calling environment, the execution methods of %chin% have been fundamentally changed.

  1. When sub-setting a data.table in cases where the same column will be used to subset the table multiple times, performance can be drastically increased by auto-indexing, so stick with the default options: options(datatable.auto.index = TRUE) and options(datatable.use.index = TRUE). In this case, using %chin% will not result in any performance improvement.

  2. If performance is critical and you know that you will only be sub-setting on a given column a single time (and thus not benefiting from the time spent generating the index), turning auto-indexing off with options(datatable.auto.index = FALSE)and can give marginally faster results. You can still manually create indexes and keys as always if you will be performing repetitive subsets, but the burden of optimization will rest on the user to make appropriate use of setkey() and setindex(). When auto-indexing is turned off, using %chin% will be faster than %in%.

  3. When testing whether an element is present in a character vector outside of the data.table calling environment , %chin% is still faster than %in%

这篇关于使用%chin%对自动索引的data.table的字符列进行子集化会提高速度吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆