筛选数据表时,链接比ANDing的性能优势 [英] Performance benefits of chaining over ANDing when filtering a data table

查看:75
本文介绍了筛选数据表时,链接比ANDing的性能优势的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我习惯于将相似的任务集中在一行中。例如,如果我需要过滤 a b c 在数据表中,我将它们与ANDs放在一个 [] 中。昨天,我注意到在我的特定情况下,它的运行速度非常慢,并且经过了测试,却没有测试链接过滤器。我在下面提供了一个示例。

I'm in the habit of clumping similar tasks together into a single line. For example, if I need to filter on a, b, and c in a data table, I'll put them together in one[] with ANDs. Yesterday, I noticed that in my particular case this was incredibly slow and tested chaining filters instead. I've included an example below.

首先,我给随机数生成器添加种子,加载,并创建一个虚拟数据集。

First, I seed the random number generator, load data.table, and create a dummy data set.

# Set RNG seed
set.seed(-1)

# Load libraries
library(data.table)

# Create data table
dt <- data.table(a = sample(1:1000, 1e7, replace = TRUE),
                 b = sample(1:1000, 1e7, replace = TRUE),
                 c = sample(1:1000, 1e7, replace = TRUE),
                 d = runif(1e7))

接下来,我定义我的方法。第一种方法将过滤器链接在一起。第二个将过滤器与在一起。

Next, I define my methods. The first approach chains filters together. The second ANDs the filters together.

# Chaining method
chain_filter <- function(){
  dt[a %between% c(1, 10)
     ][b %between% c(100, 110)
       ][c %between% c(750, 760)]
}

# Anding method
and_filter <- function(){
  dt[a %between% c(1, 10) & b %between% c(100, 110) & c %between% c(750, 760)]
}

在这里,我检查它们给出的结果是否相同。

Here, I check they give the same results.

# Check both give same result
identical(chain_filter(), and_filter())
#> [1] TRUE

最后,我对其进行基准测试。

Finally, I benchmark them.

# Benchmark
microbenchmark::microbenchmark(chain_filter(), and_filter())
#> Unit: milliseconds
#>            expr      min        lq      mean    median        uq       max
#>  chain_filter() 25.17734  31.24489  39.44092  37.53919  43.51588  78.12492
#>    and_filter() 92.66411 112.06136 130.92834 127.64009 149.17320 206.61777
#>  neval cld
#>    100  a 
#>    100   b

reprex程序包(v0.3.0)

Created on 2019-10-25 by the reprex package (v0.3.0)

在这种情况下,链接可将运行时间减少约70%。为什么会这样呢?我的意思是,数据表的底层是什么?我还没有看到任何关于使用& 的警告,所以令我惊讶的是两者之间的差异如此之大。在这两种情况下,他们都评估相同的条件,因此不应该有所不同。在AND的情况下,& 是快速运算符,然后它只需要过滤一次数据表(即,使用AND产生的逻辑向量),相反

In this case, chaining reduces run time by about 70%. Why is this the case? I mean, what's going on under the hood in data table? I haven't seen any warnings against using &, so I was surprised that the difference is so big. In both cases they evaluate the same conditions, so that shouldn't be a difference. In the AND case, & is a quick operator and then it only has to filter the data table once (i.e., using the logical vector resulting from the ANDs), as opposed to filtering three times in the chaining case.

通常,该原理是否适用于数据表操作?

Does this principle hold for data table operations in general? Is modularising tasks always a better strategy?

推荐答案

在大多数情况下,答案都是在有争议的评论中给出的:链接方法用于在这种情况下, data.table 比 anding方法更快,因为链接会依次运行条件。由于每一步都会减小 data.table 的大小,因此对下一个评估的内容就更少了。 Anding每次都会评估完整大小数据的条件。

Mostly, the answer was given in the comments aleady: the "chaining method" for data.table is faster in this case than the "anding method" as the chaining runs the conditions one after another. As each step reduces the size of the data.table there is less to evaluate for the next one. "Anding" evaluates the conditions for the full size data each time.

我们可以举一个例子来说明这一点:当各个步骤不减小<$的大小时c $ c> data.table (即两个方法要检查的条件都相同):

We can demonstrate this with an example: when the individual steps do NOT decrease the size of the data.table (i.e. the conditions to check are the same for both appraoches):

chain_filter <- function(){
  dt[a %between% c(1, 1000) # runs evaluation but does not filter out cases
     ][b %between% c(1, 1000)
       ][c %between% c(750, 760)]
}

# Anding method
and_filter <- function(){
  dt[a %between% c(1, 1000) & b %between% c(1, 1000) & c %between% c(750, 760)]
}

使用相同的数据,但 bench 软件包,该软件包会自动检查结果是否相同:

Using the same data but the bench package, which automatically checks if results are identical:

res <- bench::mark(
  chain = chain_filter(),
  and = and_filter()
)
summary(res)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 chain         299ms    307ms      3.26     691MB     9.78
#> 2 and           123ms    142ms      7.18     231MB     5.39
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 chain       2.43   2.16      1         2.99     1.82
#> 2 and         1      1         2.20      1        1

在这里您可以看到 anding方法是2.43次在这种情况下更快。这意味着链接实际上会增加一些开销,这表明通常anding应该更快。 如果条件逐步减小了 data.table 的大小,则除外。从理论上讲,链接方法甚至可能更慢(甚至不考虑开销),即条件是否会增加数据的大小。但是实际上,我认为这是不可能的,因为 data.table 不允许回收逻辑向量。我认为这可以回答您的奖励问题。

As you can see here the anding approach is 2.43 times faster in this case. That means chaining actually adds some overhead, suggesting that usually anding should be quicker. EXCEPT if conditions are reducing the size of the data.table step by step. Theoretically, the chaining approach could even be slower (even leaving the overhead aside), namely if a condition would increase the size of the data. But practically I think that's not possible since recycling of logical vectors is not allowed in data.table. I think this answers your bonus question.

为进行比较,本机上具有 bench 的原始功能:

For comparison, original functions on my machine with bench:

res <- bench::mark(
  chain = chain_filter_original(),
  and = and_filter_original()
)
summary(res)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 chain        29.6ms   30.2ms     28.5     79.5MB     7.60
#> 2 and         125.5ms  136.7ms      7.32   228.9MB     7.32
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 chain       1      1         3.89      1        1.04
#> 2 and         4.25   4.52      1         2.88     1

这篇关于筛选数据表时,链接比ANDing的性能优势的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆