R - 向量化条件替换 [英] R - vectorised conditional replace

查看:157
本文介绍了R - 向量化条件替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好我想操作一个数字列表,我想这样做没有for循环,在R中使用快速本地操作。操作的伪代码是:


默认情况下,起始总数为100(对于零中的每个块)

从第一个零到下一个零,累计总额下降超过2%,将所有后续数字替换为零。

到目前为止,所有数字块都在零之内



累计总和重置为100次

例如,如果以下是我的数据:

  d <-c(0,0,0,1,3,4,5,-1,2,3,-5,8, 0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1); 

结果将会是:

  0 0 0 1 3 4 5 -1 2 3 -5 0 0 0 -2 -3 0 0 0 0 0 -1 -1 -1 0 


目前我有一个for循环的实现,但是因为我的向量很长,所以性能很差。


$


$ b

以下是一个正在运行的示例代码:

  d <-c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3, 5,0,0,0,-1,-1,-1,-1); 
ans< - d;
running_total< - 100;
count < - 1;
max < - 100;
切换< - FALSE;
处理< - FALSE; ($ i











$ b $
if(toggle == TRUE){
ans [count] = 0;
}
else {
running_total = running_total + i;
$ b $ if(running_total> max){max = running_total;}
else if(0.98 * max> running_total){
toggle< - TRUE;



$ b if(i == 0&& processing == TRUE)
{
running_total = 100 ;
max = 100;
切换< - FALSE;
}
count < - count + 1;

cat(ans)


解决方案

<我不知道如何将你的循环转化为矢量化的操作。不过,有两个相当简单的选择来提高性能。首先是简单地把你的循环放到 R 函数中,并使用编译器包来预编译它。第二个稍微复杂的选项是将你的 R 循环转换成 c ++ 循环,并使用 Rcpp 包将其链接到 R 函数。然后你调用一个 R 函数,把它传递给 c ++ 这个代码很快。我显示这两个选项和时间。我很想感谢Rcpp listserv的Alexandre Bujard的帮助,他帮我指点了一个我不明白的问题。

首先,这是你的 R 循环作为函数, foo.r



<$ p $您的R循环作为函数
foo.r< - 函数(d){
ans< - d
running_total< - 100
count < - 1
max < - 100
toggle< - FALSE
处理< - FALSE

for(i in d){
if(toggle == TRUE){
ans [count] < - 0
}如果(i!= 0){
处理< - TRUE
} else {
running_total = running_total + i;
if(running_total> max){
max< - running_total
} else if(0.98 * max> running_total){
toggle< - TRUE
}

$ b if(i == 0&& processing == TRUE){
running_total < - 100
max < - 100
toggle< - FALSE
}
count< - count + 1
}
return(ans)
}
编译器包并编译这个函数,并把它称为<$ c $ p

$> c $ c> foo.rcomp 。

  ##加载编译器包并编译你的R循环
require(compiler)
foo.rcomp < - cmpfun(foo.r)



<这就是编译路线所需的一切。这全是 R ,显然非常简单。现在,我们使用 Rcpp 包以及 inline code> package,它允许我们内嵌 c ++ 代码。也就是说,我们不必编译一个源文件并编译它,我们只是将它包含在 R 代码中,编译就是为我们处理的。

  ## load Rcpp包和内联以方便链接
require(Rcpp)
require(内联)

## Rcpp版本
src< - '
const NumericVector xx(x);
int n = xx.size();
NumericVector res = clone(xx);
int toggle = 0;
int处理= 0;
int tot = 100;
int max = 100;

typedef NumericVector :: iterator vec_iterator;
vec_iterator ixx = xx.begin();
vec_iterator ires = res.begin();
for(int i = 0; i if(ixx [i]!= 0){
processing = 1;
if(toggle == 1){
ires [i] = 0;
} else {
tot + = ixx [i];
if(tot> max){
max = tot;
} else if(.98 * max> tot){
toggle = 1;




if(ixx [i] == 0&& processing == 1){
tot = 100;
max = 100;
toggle = 0;
}
}
return res;


foo.rcpp< - cxxfunction(signature(x =numeric),src,plugin =Rcpp)
/ pre>

现在我们可以测试我们得到的预期结果:

pre $ ##显示等同于
d <-c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3 ,3,5,0,0,0,-1,-1,-1,-1)
all.equal(foo.r(d),foo.rcpp(d))

最后,通过重复10e4创建更大版本的 d 倍。然后我们可以运行这三个不同的函数:纯代码 R 代码,编译代码 R 代码和 c ++ 代码链接的R 函数

  ##做更大的向量来测试性能
dbig < - rep(d,10 ^ 5)

system.time(res.r <-foo.r(dbig))
system.time(res.rcomp< -foo.rcomp(dbig))
system.time(res.rcpp< -foo.rcpp(dbig))
< / $ c

$ p
$ b $ p

$ $ $ $ $ $ C>> system.time(res.r <-foo.r(dbig))
用户系统经过
12.55 0.02 12.61
> system.time(res.rcomp< -foo.rcomp(dbig));
用户系统经过的
2.17 0.01 2.19
> system.time(res.rcpp <-foo.rcpp(dbig))
用户系统已用完
0.01 0.00 0.02



编译的 R 代码大约需要编译的时间的1/6 R 代码只需要2秒就可以运行在250万的矢量上。即使编译完成的 R 代码只需0.02秒, c ++ 代码也要快几个数量级。除了初始设置,基本循环的语法在 R c ++ 中几乎是相同的,所以你甚至不用失去清晰度。我怀疑,即使你的循环的部分或全部都可以在 R 中进行向量化,那么你将会为了击败 R 链接到 c ++ 的函数。最后,只是为了证明:

 > all.equal(res.r,res.rcomp)
[1] TRUE
> all.equal(res.r,res.rcpp)
[1] TRUE

不同的函数返回相同的结果。


Hi I'm trying manipulate a list of numbers and I would like to do so without a for loop, using fast native operation in R. The pseudocode for the manipulation is :

By default the starting total is 100 (for every block within zeros)

From the first zero to next zero, the moment the cumulative total falls by more than 2% replace all subsequent numbers with zero.

Do this far all blocks of numbers within zeros

The cumulative sums resets to 100 every time

For example if following were my data :

d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);

Results would be :

0 0 0 1 3 4 5 -1 2 3 -5 0 0 0 -2 -3 0 0 0 0 0 -1 -1 -1 0

Currently I have an implementation with a for loop, but since my vector is really long, the performance is terrible.

Thanks in advance.

Here is a running sample code :

d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
ans <- d;
running_total <- 100;
count <- 1;
max <- 100;
toggle <- FALSE;
processing <- FALSE;

for(i in d){
  if( i != 0 ){  
       processing <- TRUE; 
       if(toggle == TRUE){
          ans[count] = 0;  
       }
       else{
         running_total = running_total + i;

          if( running_total > max ){ max = running_total;}
          else if ( 0.98*max > running_total){
              toggle <- TRUE;  
          }
      }
   }

   if( i == 0 && processing == TRUE )
   { 
       running_total = 100; 
       max = 100;
       toggle <- FALSE;
   }
   count <- count + 1;
}
cat(ans)

解决方案

I am not sure how to translate your loop into vectorized operations. However, there are two fairly easy options for large performance improvements. The first is to simply put your loop into an R function, and use the compiler package to precompile it. The second slightly more complicated option is to translate your R loop into a c++ loop and use the Rcpp package to link it to an R function. Then you call an R function that passes it to c++ code which is fast. I show both these options and timings. I do want to gratefully acknowledge the help of Alexandre Bujard from the Rcpp listserv, who helped me with a pointer issue I did not understand.

First, here is your R loop as a function, foo.r.

## Your R loop as a function
foo.r <- function(d) {
  ans <- d
  running_total <- 100
  count <- 1
  max <- 100
  toggle <- FALSE
  processing <- FALSE

  for(i in d){
    if(i != 0 ){
      processing <- TRUE
      if(toggle == TRUE){
        ans[count] <- 0
      } else {
        running_total = running_total + i;
        if (running_total > max) {
          max <- running_total
        } else if (0.98*max > running_total) {
          toggle <- TRUE
        }
      }
    }
    if(i == 0 && processing == TRUE) {
      running_total <- 100
      max <- 100
      toggle <- FALSE
    }
    count <- count + 1
  }
  return(ans)
}

Now we can load the compiler package and compile the function and call it foo.rcomp.

## load compiler package and compile your R loop
require(compiler)
foo.rcomp <- cmpfun(foo.r)

That is all it takes for the compilation route. It is all R and obviously very easy. Now for the c++ approach, we use the Rcpp package as well as the inline package which allows us to "inline" the c++ code. That is, we do not have to make a source file and compile it, we just include it in the R code and the compilation is handled for us.

## load Rcpp package and inline for ease of linking
require(Rcpp)
require(inline)

## Rcpp version
src <- '
  const NumericVector xx(x);
  int n = xx.size();
  NumericVector res = clone(xx);
  int toggle = 0;
  int processing = 0;
  int tot = 100;
  int max = 100;

  typedef NumericVector::iterator vec_iterator;
  vec_iterator ixx = xx.begin();
  vec_iterator ires = res.begin();
  for (int i = 0; i < n; i++) {
    if (ixx[i] != 0) {
      processing = 1;
      if (toggle == 1) {
        ires[i] = 0;
      } else {
        tot += ixx[i];
        if (tot > max) {
          max = tot;
        } else if (.98 * max > tot) {
            toggle = 1;
          }
      }
    }

   if (ixx[i] == 0 && processing == 1) {
     tot = 100;
     max = 100;
     toggle = 0;
   }
  }
  return res;
'

foo.rcpp <- cxxfunction(signature(x = "numeric"), src, plugin = "Rcpp")

Now we can test that we get the expected results:

## demonstrate equivalence
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1)
all.equal(foo.r(d), foo.rcpp(d))

Finally, create a much larger version of d by repeating it 10e4 times. Then we can run the three different functions, pure R code, compiled R code, and R function linked to c++ code.

## make larger vector to test performance
dbig <- rep(d, 10^5)

system.time(res.r <- foo.r(dbig))
system.time(res.rcomp <- foo.rcomp(dbig))
system.time(res.rcpp <- foo.rcpp(dbig))

Which on my system, gives:

> system.time(res.r <- foo.r(dbig))
   user  system elapsed 
  12.55    0.02   12.61 
> system.time(res.rcomp <- foo.rcomp(dbig))
   user  system elapsed 
   2.17    0.01    2.19 
> system.time(res.rcpp <- foo.rcpp(dbig))
   user  system elapsed 
   0.01    0.00    0.02 

The compiled R code takes about 1/6 the time the uncompiled R code taking only 2 seconds to operate on the vector of 2.5 million. The c++ code is orders of magnitude faster even then the compiled R code requiring just .02 seconds to complete. Aside from the initial setup, the syntax for the basic loop is nearly identical in R and c++ so you do not even lose clarity. I suspect that even if parts or all of your loop could be vectorized in R, you would be sore pressed to beat the performance of the R function linked to c++. Lastly, just for proof:

> all.equal(res.r, res.rcomp)
[1] TRUE
> all.equal(res.r, res.rcpp)
[1] TRUE

The different functions return the same results.

这篇关于R - 向量化条件替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆