R:具有给定坐标的快速滑动窗口 [英] R: fast sliding window with given coordinates

查看:75
本文介绍了R:具有给定坐标的快速滑动窗口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据表,nrow大约为一百万或二,ncol约为200。



一行中的每个条目都有一个与之关联的坐标。 / p>

数据的微小部分:

  [1,] -2.80331471 -0.8874522 -2.34401863 -3.811584 -2.1292443 
[2,] 0.03177716 0.2588624 0.82877467 1.955099 0.6321881
[3,] -1.32954665 -0.5433407 -2.19211837 -2.342554 -2.2142461
[4,] -0.60771429- 0.9758734 0.01558774 1.651459 -0.8137684

前4行的坐标:

  9928202 9928251 9928288 9928319 

我怎么办想要一个给定的数据和窗口大小的函数将返回相同大小的数据表,并在每列上应用平均滑动窗口。换句话说-对于每个行条目 i ,它将找到坐标为[coords [i] -windsize和coords [i] + windsize之间的坐标的条目,并将初始值替换为其中的值的平均值间隔(每列分别)。



速度是这里的主要问题。



这是我的第一个建议

  doSlidingWindow<-函数(强度,坐标,windsize){
windHalfSize<-天花板(windsize / 2)
###整个范围inds
RANGE<-整数(max(coords)+ windsize)
RANGE [coords]<-c(1:length(coords )[1])$ ​​b
$ b ###获取落在每个窗口中的行的索引
COORDS<-as.list(coords)
WINDOWINDS<-sapply(COORDS, function(crds){unique(RANGE [(crds-windHalfSize):
(crds + windHalfSize)])})

###做开窗

wind_ints <-强度
wind_ints []<-0
for(i in 1:length(coords)){
wind_ints [i,]<-apply(as.matrix(intens ities [WINDOWINDS [[i],]),2,均值)
}
return(wind_ints)
}

最后一个for循环之前的代码非常快,它为我提供了我需要用于每个条目的索引列表。但是随后一切都崩溃了,因为我需要研磨一百万遍for循环,获取数据表的子集,并确保我有多于一行的内容,以便能够在apply内部一次处理所有列。 p>

我的第二种方法是将实际值粘贴在RANGE列表中,用零填充空白并从zoo包中进行rollmean,每列重复一次。但这是多余的,因为rollmean会克服所有空白,而我最终只会使用原始坐标的值。



任何帮助使它更快而无需走的步

解决方案

数据生成:

  N<-1e5#行
M<-200#列
W<-10#窗口大小

set。 (1)
强度<-矩阵(rnorm(N * M),nrow = N,ncol = M)
坐标<-8000000 + sort(sample(1:(5 * N), N))

我对基准进行了轻微修改的原始函数:

  doSlidingWindow<-函数(强度,坐标,windsize){
windHalfSize<-天花板(windsize / 2)
###整个范围inds
RANGE<-整数(max(max(coords)+ windsize)
RANGE [coords]<-c(1:length(coords)[1])$ ​​b
$ b ###获取每个窗口中的行的索引
###注意:WINDOWINDS的每个元素lds零。虽然不是大问题。
WINDOWINDS<-sapply(coords,function(crds)ret<-unique(RANGE [(crds-windHalfSize):( crds + windHalfSize)]))

###做窗口
wind_ints<-强度
wind_ints []<-0
for(i in 1:length(coords)){
#校正:当它仅在一行中窗口出现问题
wind_ints [i,]<-apply(matrix(intensities [WINDOWINDS [[i]],],ncol = ncol(intensities)),2,均值)
}
return(wind_ints)
}






可能的解决方案:






1)data.table



data.table 子集设置很快,但是此页面(以及其他与滑动窗口有关的信息)表明,事实并非如此。确实, data.table 代码很优雅,但是很慢,很不幸:

  require(data.table)
require(plyr)
dt<-data.table(坐标,强度)
setkey(dt,coords)
aaply(1:N ,1,function(i)dt [WINDOWINDS [[i]],sapply(.SD,mean),.SDcols = 2:(M + 1)])






2)foreach + doSNOW



基本例程易于并行运行,因此,我们可以从中受益:

  require(doSNOW)
doSlidingWindow2<-函数(强度,坐标,windsize){
NC<-2#群集中的节点数
cl<-makeCluster(rep( localhost,NC),类型= SOCK)
registerDoSNOW(cl)

N<-ncol(intensities)#总列数
chunk<-ceiling(N / NC)#列发送到单个节点

结果<-foreach(i = 1:NC,.combine = cbind,.export = c( doSlidingWindow))%dopar%{
开始<-(i-1)*块+1
end<-ifelse(i!= NC,i * chunk,N)
doSlidingWindow(intensities [,start:end],coords,windsize)
}

stopCluster(cl)
返回(结果)
}

基准显示明显的速度-在我的双核处理器上:

  system.time(res<-doSlidingWindow(intensities,coords,W)) 
#用户系统已使用
#用户系统已使用
#306.259 0.204 307.770
system.time(res2<-doSlidingWindow2(intensities,coords,W)) 1.377 1.364 177.223
all.equal(res,res2,check.attributes = FALSE)
#[1]是






3)Rcpp



是的,我知道你问 不去C 。但是,请看看。这段代码是内联的,相当简单:

  require(Rcpp)
require(inline)
doSlidingWindow3< ;-cxxfunction(signature(intens = matrix,crds = numeric,wsize = numeric),plugin = Rcpp,body ='
#include< vector>
Rcpp: :NumericMatrix intensities(intens);
const int N = intensities.nrow();
const int M = intensities.ncol();
Rcpp :: NumericMatrix wind_ints(N,M);

std :: vector< int> =(< std :: vector< int>(crds);
int windsize = ceil(as< double>(wsize)/ 2 );

for(int i = 0; i< N; i ++){
//简单搜索窗口范围(begin:end in coords)
//假定坐标不减
int begin =(i-windsize)<0?0:(i-windsize);
while(coords [begin]<(coords [i] -windsize)) ++开始;
int end =(i + windsize)>(N-1)?(N-1):( i + windsize);
while(coords [end]>(coords [i] + windsize))--end;

for(int j = 0; j <M j ++){
双重结果= 0.0;
for(int k = begin; k< = end; k ++){
结果+ =强度(k,j);
}
wind_ints(i,j)= result /(end-begin + 1);
}
}

return wind_ints;
')

基准:

  system.time(res<-doSlidingWindow(intensities,coords,W))
#用户系统已使用
#306.259 0.204 307.770
系统。 time(res3<-doSlidingWindow3(intensities,coords,W))
#用户系统已使用
#0.328 0.020 0.351
all.equal(res,res3,check.attributes = FALSE)
#[1]是的

我希望结果能起到积极作用。虽然数据适合内存,但 Rcpp 版本非常快。说,有了 N< -1e6 M< -100 ,我得到了:

 用户系统已使用
2.873 0.076 2.951

自然地,在R开始使用交换之后,一切都会变慢。对于无法容纳在内存中的非常大的数据,您应该考虑 sqldf ff bigmemory


I have a data table with nrow being around a million or two and ncol of about 200.

Each entry in a row has a coordinate associated with it.

Tiny portion of the data:

[1,] -2.80331471  -0.8874522 -2.34401863   -3.811584   -2.1292443
[2,]  0.03177716   0.2588624  0.82877467    1.955099    0.6321881
[3,] -1.32954665  -0.5433407 -2.19211837   -2.342554   -2.2142461
[4,] -0.60771429  -0.9758734  0.01558774    1.651459   -0.8137684

Coordinates for the first 4 rows:

9928202 9928251 9928288 9928319

What I would like is a function that given the data and window-size would return a data table of the same size with a mean sliding window applied on each column. Or in other words - for each row entry i it would find entries with coordinates between coords[i]-windsize and coords[i]+windsize and replace the initial value with the mean of the values inside that interval (separately for each column).

Speed is the main issue here.

Here is my first take of such function.

doSlidingWindow <- function(intensities, coords, windsize) {
windHalfSize <- ceiling(windsize/2)
### whole range inds
RANGE <- integer(max(coords)+windsize)
RANGE[coords] <- c(1:length(coords)[1])

### get indeces of rows falling in each window
COORDS <- as.list(coords)
WINDOWINDS <- sapply(COORDS, function(crds){ unique(RANGE[(crds-windHalfSize):
    (crds+windHalfSize)]) })

### do windowing

wind_ints <- intensities
wind_ints[] <- 0
for(i in 1:length(coords)) {
    wind_ints[i,] <- apply(as.matrix(intensities[WINDOWINDS[[i]],]), 2, mean)
}
return(wind_ints)
}

The code before the last for loop is quite fast and it gets me a list of the indexes I need to use for each entry. However then everything falls apart since I need to grind the for loop a million times, take subsets of my data table and also make sure that I have more than one row to be able to work with all the columns at once inside apply.

My second approach is to just stick the actual values in the RANGE list, fill the gaps with zeroes and do rollmean from zoo package, repeated for each column. But this is redundant since rollmean will go through all the gaps and I will only be using the values for original coordinates in the end.

Any help to make it faster without going to C would be very appreciated.

解决方案

Data generation:

N <- 1e5 # rows
M <- 200 # columns
W <- 10  # window size

set.seed(1)
intensities <- matrix(rnorm(N*M), nrow=N, ncol=M)
coords <- 8000000 + sort(sample(1:(5*N), N))

Original function with minor modifications I used for benchmarks:

doSlidingWindow <- function(intensities, coords, windsize) {
  windHalfSize <- ceiling(windsize/2)
  ### whole range inds
  RANGE <- integer(max(coords)+windsize)
  RANGE[coords] <- c(1:length(coords)[1])

  ### get indices of rows falling in each window
  ### NOTE: Each elements of WINDOWINDS holds zero. Not a big problem though.
  WINDOWINDS <- sapply(coords, function(crds) ret <- unique(RANGE[(crds-windHalfSize):(crds+windHalfSize)]))

  ### do windowing
  wind_ints <- intensities
  wind_ints[] <- 0
  for(i in 1:length(coords)) {
    # CORRECTION: When it's only one row in window there was a trouble
    wind_ints[i,] <- apply(matrix(intensities[WINDOWINDS[[i]],], ncol=ncol(intensities)), 2, mean)
  }
  return(wind_ints)
}


POSSIBLE SOLUTIONS:


1) data.table

data.table is known to be fast with subsetting, but this page (and other related to sliding window) suggests, that this is not the case. Indeed, data.table code is elegant, but unfortunately very slow:

require(data.table)
require(plyr)
dt <- data.table(coords, intensities)
setkey(dt, coords)
aaply(1:N, 1, function(i) dt[WINDOWINDS[[i]], sapply(.SD,mean), .SDcols=2:(M+1)])


2) foreach+doSNOW

Basic routine is easy to run in parallel, so, we can benefit from it:

require(doSNOW)
doSlidingWindow2 <- function(intensities, coords, windsize) {
  NC <- 2 # number of nodes in cluster
  cl <- makeCluster(rep("localhost", NC), type="SOCK")
  registerDoSNOW(cl)

  N <- ncol(intensities) # total number of columns
  chunk <- ceiling(N/NC) # number of columns send to the single node

  result <- foreach(i=1:NC, .combine=cbind, .export=c("doSlidingWindow")) %dopar% {
    start <- (i-1)*chunk+1
    end   <- ifelse(i!=NC, i*chunk, N)
    doSlidingWindow(intensities[,start:end], coords, windsize)    
  }

  stopCluster(cl)
  return (result)
}

Benchmark shows notable speed-up on my Dual-Core processor:

system.time(res <- doSlidingWindow(intensities, coords, W))
#    user  system elapsed 
# 306.259   0.204 307.770
system.time(res2 <- doSlidingWindow2(intensities, coords, W))
#  user  system elapsed 
# 1.377   1.364 177.223
all.equal(res, res2, check.attributes=FALSE)
# [1] TRUE


3) Rcpp

Yes, I know you asked "without going to C". But, please, take a look. This code is inline and rather straightforward:

require(Rcpp)
require(inline)
doSlidingWindow3 <- cxxfunction(signature(intens="matrix", crds="numeric", wsize="numeric"), plugin="Rcpp", body='
  #include <vector>
  Rcpp::NumericMatrix intensities(intens);
  const int N = intensities.nrow();
  const int M = intensities.ncol();
  Rcpp::NumericMatrix wind_ints(N, M);

  std::vector<int> coords = as< std::vector<int> >(crds);
  int windsize = ceil(as<double>(wsize)/2);  

  for(int i=0; i<N; i++){
    // Simple search for window range (begin:end in coords)
    // Assumed that coords are non-decreasing
    int begin = (i-windsize)<0?0:(i-windsize);
    while(coords[begin]<(coords[i]-windsize)) ++begin;
    int end = (i+windsize)>(N-1)?(N-1):(i+windsize);
    while(coords[end]>(coords[i]+windsize)) --end;

    for(int j=0; j<M; j++){
      double result = 0.0;
      for(int k=begin; k<=end; k++){
        result += intensities(k,j);
      }
      wind_ints(i,j) = result/(end-begin+1);
    }
  }

  return wind_ints;
')

Benchmark:

system.time(res <- doSlidingWindow(intensities, coords, W))
#    user  system elapsed 
# 306.259   0.204 307.770
system.time(res3 <- doSlidingWindow3(intensities, coords, W))
#  user  system elapsed 
# 0.328   0.020   0.351
all.equal(res, res3, check.attributes=FALSE)
# [1] TRUE

I hope results are quite motivating. While data fits in memory Rcpp version is pretty fast. Say, with N <- 1e6 and M <-100 I got:

   user  system elapsed 
  2.873   0.076   2.951

Naturally, after R starts using swap everything slows down. With really large data that doesn't fit in memory you should consider sqldf, ff or bigmemory.

这篇关于R:具有给定坐标的快速滑动窗口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆