为什么在“逻辑"上设置子集?键入比“数字"上的子集慢类型? [英] Why is subsetting on a "logical" type slower than subsetting on "numeric" type?

查看:24
本文介绍了为什么在“逻辑"上设置子集?键入比“数字"上的子集慢类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一个 vector(或一个 data.frame),如下所示:

Suppose we've a vector (or a data.frame for that matter) as follows:

set.seed(1)
x <- sample(10, 1e6, TRUE)

并且想要获得 x 的所有值,其中 x >4,说:

And one wants to get all values of x where x > 4, say:

a1 <- x[x > 4] # (or) 
a2 <- x[which(x > 4)]

identical(a1, a2) # TRUE

我想大多数人会更喜欢 x[x >4].但令人惊讶的是(至少对我而言),使用 which 进行子集设置更快!

I think most people would prefer x[x > 4]. But surprisingly (at least to me), subsetting using which is faster!

require(microbenchmark)
microbenchmark(x[x > 4], x[which(x > 4)], times = 100)

Unit: milliseconds
            expr      min       lq   median       uq       max neval
        x[x > 4] 56.59467 57.70877 58.54111 59.94623 104.51472   100
 x[which(x > 4)] 26.62217 27.64490 28.31413 29.97908  99.68973   100

我的大约快 2.1 倍.

It's about 2.1 times faster on mine.

差异的一种可能性,我认为,可能是由于 which 不考虑 NA> 返回他们也是.但是逻辑操作本身应该是造成这种差异的原因,不是情况(显然).即:

One possibility for the difference, I thought, could be due to the fact that which doesn't consider NA but > returns them as well. But then logical operation itself should be the reason for this difference, which is not the case (obviously). That is:

microbenchmark(x > 4, which(x > 4), times = 100)

Unit: milliseconds
         expr       min       lq   median       uq      max neval
        x > 4  8.182576 10.06163 12.68847 14.64203 60.83536   100
 which(x > 4) 18.579746 19.94923 21.43004 23.75860 64.20152   100

使用 which 在子集化之前大约慢 1.7 倍.但是 which 似乎在子集化时/在子集化过程中大幅赶上了.

Using which is about 1.7 times slower just before subsetting. But which seems to catch up drastically on/during subsetting.

似乎无法使用我通常选择的武器 debugonce (感谢@GavinSimpson) 作为 which 调用 .Internal(which(x))== 调用 .原语("==").

It seems not possible to use my usual weapon of choice debugonce (thanks to @GavinSimpson) as which calls .Internal(which(x)) whereas == calls .Primitive("==").

因此,我的问题是为什么 [numeric 类型上由 which 产生比由 > 产生的逻辑向量更快代码>?有什么想法吗?

My question therefore is why is [ on numeric type resulting from which faster than logical vector resulting from >? Any ideas?

推荐答案

我想我应该移出评论并添加一个答案.这是我根据其他人的回答和讨论建立的预感.(我确信真正的答案存在于 subset_dflt 的 C 源代码中.)

I think I should move out of the comments and add an answer. This is my hunch building up on what the others have answered and discussed. (I'm sure the real answer exists in the C source for subset_dflt.)

一旦我有了一个向量 x 和一个逻辑向量 x >0,我可以在 x > 上子集 x0 两种方式.我可以使用 which 或者我可以使用向量 x >0 直接作为索引.但是,我们必须注意这两者并不相同,因为 x[x >0] 将保留 NAs 而 x[which(x > 0)] 不会.

Once I have a vector x and a logical vector x > 0, I can subset x on x > 0 in two ways. I can use which or I can use the vector x > 0 directly as the indexing. However, we must note that the two are not identical since x[x > 0] will preserve NAs while x[which(x > 0)] will not.

但是,无论哪种方法,我都需要检查向量 x > 的每个元素.0.在显式的 which 调用中,我将只需要检查元素的布尔状态,而在直接子设置操作中,我将不得不检查每个元素的缺失和布尔状态.

However, in either method, I will need to examine each element of the vector x > 0. In an explicit which call I will have to examine only the boolean state of the element while in a direct sub-setting operation I will have to examine both missing-ness and the boolean state of each element.

@flodel 带来了一个有趣的观察.由于 [is.nawhich| 都是原语或内部例程,我们假设没有异常开销并做这个实验:

@flodel brings an interesting observation. Since [, is.na, which, and | are all primitives or internal routines, let's assume no extraordinary overhead and do this experiment:

microbenchmark(which(x > 0), x[which(x > 0)], x > 0 | is.na(x), x[x > 0],
               unit="us", times=1000)

Unit: microseconds
             expr      min       lq   median       uq      max neval
     which(x > 0) 1219.274 1238.693 1261.439 1900.871 23085.57  1000
  x[which(x > 0)] 1554.857 1592.543 1974.370 2339.238 23816.99  1000
 x > 0 | is.na(x) 3439.191 3459.296 3770.260 4194.474 25234.70  1000
         x[x > 0] 3838.455 3876.816 4267.261 4621.544 25734.53  1000

考虑中值,我们可以看到,假设 x >0 |is.na(x) 是我所说的发生在逻辑子集的粗略模型,然后在子集"中实际花费的时间是 ~ 500 us.'subset' 所用的时间约为 700 us.这两个数字是可比的,并表明在一种或另一种方法中成本高昂的并不是子集"本身.取而代之的是,在which 方法中计算所需子集的成本更低.

Considering median values, we can see that, assuming x > 0 | is.na(x) is a crude model of what I am saying happens in logical sub-setting, then the actual time taken in 'subset' is ~ 500 us. And the time taken in 'subset' with which is ~ 700 us. Both the numbers are comparable and indicate that it is not the 'subset'ing itself which is costly in one method or another. In stead, it is what is being done to compute the subset wanted that is cheaper in the which method.

这篇关于为什么在“逻辑"上设置子集?键入比“数字"上的子集慢类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆