针对which()或match()的更有效策略 [英] More efficient strategy for which() or match()

查看:112
本文介绍了针对which()或match()的更有效策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正负矢量

vec<-c(seq(-100,-1), rep(0,20), seq(1,100))

向量比示例大,并且具有一组随机值.我必须反复查找向量中的负数...我发现这效率很低.

the vector is larger than the example, and takes on a random set of values. I have to repetitively find the number of negative numbers in the vector... I am finding this is quite inefficient.

由于我只需要找到负数的数量,并且对向量进行排序,所以我只需要知道前0或正数的索引(实际随机向量中可能没有0).

Since I only need to find the number of negative numbers, and the vector is sorted, I only need to know the index of the first 0 or positive number (there may be no 0s in the actual random vectors).

当前,我正在使用这段代码来查找长度

Currently I am using this code to find the length

length(which(vec<0))

但是这迫使R遍历整个向量,但是由于它是经过排序的,因此没有必要.

but this forces R to go through the entire vector, but since it is sorted, there is no need.

我可以使用

match(0, vec)

但是我的向量并不总是有0s

but my vector does not always have 0s

所以我的问题是,是否有某种match()函数应用条件而不是查找特定值?还是有一种更有效的方式来运行我的what()代码?

So my question is, is there some kind of match() function that applies a condition instead of finding a specific value? Or is there a more efficient way to run my which() code?

谢谢

推荐答案

到目前为止提供的解决方案都暗示着创建logical(length(vec))并对其进行完全或部分扫描.如您所述,向量已排序.我们可以通过执行二进制搜索来利用它.我开始认为自己会超级聪明,可以用C来实现更快的速度,但是在调试算法索引时遇到了麻烦(这是棘手的部分!).所以我在R中写了它:

The solutions offered so far all imply creating a logical(length(vec)) and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:

f3 <- function(x) {
    imin <- 1L
    imax <- length(x)
    while (imax >= imin) {
        imid <- as.integer(imin + (imax - imin) / 2)
        if (x[imid] >= 0)
            imax <- imid - 1L
        else
            imin <- imid + 1L
    }
    imax
}

与其他建议进行比较

f0 <- function(v) length(which(v < 0))
f1 <- function(v) sum(v < 0)
f2 <- function(v) which.min(v < 0) - 1L

好玩

library(compiler)
f3.c <- cmpfun(f3)

领先

> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
> identical(f0(vec), f1(vec))
[1] TRUE
> identical(f0(vec), f2(vec))
[1] TRUE
> identical(f0(vec), f3(vec))
[1] TRUE
> identical(f0(vec), f3.c(vec))
[1] TRUE
> microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
Unit: microseconds
      expr       min        lq     median         uq       max neval
   f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903   100
   f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293   100
   f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889   100
   f3(vec)    51.715    56.050    75.4495    78.5295   100.730   100
 f3.c(vec)    11.612    17.147    28.5570    31.3160    49.781   100

可能有些棘手的边缘情况我弄错了!转到C,我做到了

Probably there are some tricky edge cases that I've got wrong! Moving to C, I did

library(inline)
f4 <- cfunction(c(x = "numeric"), "
    int imin = 0, imax = Rf_length(x) - 1, imid;
    while (imax >= imin) {
        imid = imin + (imax - imin) / 2;
        if (REAL(x)[imid] >= 0)
            imax = imid - 1;
        else
            imin = imid + 1;
    }
    return ScalarInteger(imax + 1);
")

使用

> identical(f3(vec), f4(vec))
[1] TRUE
> microbenchmark(f3(vec), f3.c(vec), f4(vec))
Unit: nanoseconds
      expr   min      lq  median      uq   max neval
   f3(vec) 52096 53192.0 54918.5 55539.0 69491   100
 f3.c(vec) 10924 12233.5 12869.0 13410.0 20038   100
   f4(vec)   553   796.0   893.5  1004.5  2908   100

R-help 列表.这很慢但很安全,请检查vec是否已实际排序并处理NA值.如果有人想生活在边缘(可以说实施f3或f4并不差),那么

findInterval came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then

f5.i <- function(v)
    .Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))

与C实现几乎一样快,但可能更健壮和向量化(即,在第二个参数中查找值的向量,以便进行类似范围的计算).

is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).

这篇关于针对which()或match()的更有效策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆