使用C和并行R中快速相关 [英] Fast correlation in R using C and parallelization

查看:186
本文介绍了使用C和并行R中快速相关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我今天的项目是使用基本技能我必须写在R A快速相关程序。我必须找到各有将近一百万的观测近400变量之间的相关性(即大小P = 1MM行和放大器的矩阵; N = 400 COLS)。

研究的本地相关函数花费1MM行近2分钟,每个变量200观察。我没有为每列400的观测运行,但我的猜测是,这将需要近8分钟。我有不到30秒完成它。

因此​​,我想要做的做的事情。

1 - C编写一个简单的相关函数,并将其应用在平行块(见下文)

2 - 块 - 在三个块(大小K * K,尺寸的右下方(PK)的(PK)和尺寸的K右上方矩形矩阵<顶部左方分裂相关矩阵/ EM>(PK))。这涵盖了所有细胞的相关矩阵科尔,因为我只需要上三角。

3 - 运行通过电话.C采用平行降雪C函数

  N = 100
P = 10
X =矩阵(RNORM(N * P),nrow = N,NCOL = P)
压改正=矩阵(0,nrow = P,NCOL = p)的列明智的均值和标准差的计算#传递给CORR函数
亩= colMeans(X)
SD = sapply(1:昏暗(X)的[2],函数(x)的标准差(X [中,x))#设置子矩阵的行和列范围
K = as.integer(P / 2)RowRange =名单()
ColRange =名单()
RowRange [[1]] = C(0,K)的
ColRange [[1]] = C(0,K)的RowRange [[2] = C(0,K)的
ColRange [[2] = C(K,P + 1)的RowRange [[3]] = C(K,P + 1)的
ColRange [[3]] = C(K,P + 1)的#方法1不平行
########################
#函数来计算的相关子矩阵
BigCorr&LT; - 功能(X){
  行= RowRange [[X]]
  COLS = ColRange [[X]]
  返回(.C(rCorrelationWrapper2,as.matrix(X),as.integer(点心(X)),
            as.double(亩),as.double(SD)
            as.integer(行),as.integer(COLS)
            as.matrix(科尔)))
}RES =名单()
为(ⅰ在1:3){
  RES [我] = BigCorr(I)
}#方法2
########################
BigCorr&LT; - 功能(X){
    行= RowRange [[X]]
    COLS = ColRange [[X]]
    dyn.load(./ rCorrelation.so)
    返回(.C(rCorrelationWrapper2,as.matrix(X),as.integer(点心(X)),
          as.double(亩),as.double(SD)
          as.integer(行),as.integer(COLS)
          as.matrix(科尔)))
}#并行化设置
NUM_CPU = 4
库('雪')
sfSetMaxCPUs()#最大的CPU处理
sfInit(并行= TRUE,处理器= NUM​​_CPU)#的init并行特效
sfExport(X,RowRange,ColRange,SD,μ,压改正)
RES = sfLapply(1:3,BigCorr)
sfStop()

下面是我的问题:

有关方法1,它的工作原理,但不是我想要的方式。我相信,当我通过科尔矩阵,我传递一个地址和C将作出在源代码更改。

 方法1#输出
&GT;水库[[1]] [7]
      [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
 [1,] 1 0.1040506 -0.01003125 0.23716384 -0.088246793 0 0 0 0 0
 [2,] 0 1.0000000 -0.09795989 0.11274508 0.025754150 0 0 0 0 0
 [3] 0 0.0000000 1.00000000 0.09221441 0.052923520 0 0 0 0 0
 [4,] 0 0.0000000 0.00000000 1.00000000 -0.000449975 0 0 0 0 0
 [5,1] 0 0.0000000 0.00000000 0.00000000 1.000000000 0 0 0 0 0
 [6] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
 [7] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
 [8] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
 [9] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
[10] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
&GT;水库[[2]] [[7]
      [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
 [1,] 0 0 0 0 0 -0.02261175 -0.23398448 -0.02382690 -0.09668318 -0.1447913
 [2,] 0 0 0 0 0 -0.03439707 0.04580888 0.13229376 0.1354754 -0.03376527
 [3,] 0 0 0 0 0 0.10360907 -0.05490361 -0.01237932 0.08123683 -0.1657041
 [4,] 0 0 0 0 0 0.18259522 -0.23849323 -0.15928474 0.1648969 -0.05005328
 [5,1] 0 0 0 0 0 -0.01012952 -0.03482429 0.14680301 -0.1112500 0.02801333
 [6] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
 [7] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
 [8] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
 [9] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
[10] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
&GT;水库[[3]] [[7]
      [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
 [1,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [2,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [3] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [4,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [5,1] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [6,] 0 0 0 0 0 1 0.03234195 -0.03488812 -0.18570151 0.14064640
 [7,] 0 0 0 0 0 0 1.00000000 0.03449697 -0.06765511 -0.15057244
 [8,] 0 0 0 0 0 0 0.00000000 1.00000000 -0.03426464 0.10030619
 [9] 0 0 0 0 0 0 0.00000000 0.00000000 1.00000000 -0.08720512
[10] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 1.00000000

但原科尔矩阵保持不变:

 &GT;科尔
      [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
 [1,] 0 0 0 0 0 0 0 0 0 0
 [2,] 0 0 0 0 0 0 0 0 0 0
 [3,] 0 0 0 0 0 0 0 0 0 0
 [4,] 0 0 0 0 0 0 0 0 0 0
 [5,1] 0 0 0 0 0 0 0 0 0 0
 [6,] 0 0 0 0 0 0 0 0 0 0
 [7,] 0 0 0 0 0 0 0 0 0 0
 [8,] 0 0 0 0 0 0 0 0 0 0
 [9,] 0 0 0 0 0 0 0 0 0 0
[10,] 0 0 0 0 0 0 0 0 0 0

问题1:有没有什么办法,以确保C函数在源头改变科尔的价值?我仍然可以合并这三个创建上三角相关矩阵,但我想知道,如果在源头改变是可能的。注意:这并不能帮助我实现快速相关,因为我只是运行一个循环

问题2:对于方法2,我怎么加载共享对象每个内核并行作业上的初始步骤每个核心(而不是如何我都做到了)

问题3:这是什么错误呢?我需要一些指点,我很想来调试它自己。

问题4:有400多计算矩阵1MM的相关性,在不到30秒的快速方法

当我运行方法2,我收到以下错误:

  R(6107)的malloc:***错误对象0x100664df8:为释放对象不正确的校验 - 对象被释放后,可能被修改。
***在malloc_error_break设置一个断点调试
错误反序列化(节点$ CON):从连接读取错误

下面附

是我的单纯功能C code代表的相关性:

 的#include&LT;&stdio.h中GT;
#包括LT&;&math.h中GT;
#包括LT&;&stdlib.h中GT;
#包括LT&;&STDDEF.H GT;
#包括LT&;&R.H GT; //显示中的R的错误
双calcMean(双* X,INT N);
双calcStdev(双* X,双亩,INT N);
双calcCov(双* X,双* Y,INT N,厦门大学双,双YMU);无效rCorrelationWrapper2(双* X,INT *黯淡,双*万亩,双SD *,为int * RowRange,为int * ColRange,双*科尔){    INT I,J,N =暗淡[0],P =暗淡[1];
    INT RowStart = RowRange [0],RowEnd = RowRange [1],ColStart = ColRange [0],ColEnd = ColRange [1];
    双xyCov;    Rprintf(\\ n号码:%D,%D&LT; =&行LT;%D,%D&LT; =&山坳LT;%D,P,RowStart,RowEnd,ColStart,ColEnd);    如果(RowStart == ColStart&放大器;&安培; RowEnd == ColEnd){
        对于(i = RowStart; I&LT; RowEnd;我++){
            为(J = I; J&LT; ColEnd; J ++){
                Rprintf(\\ N I:%d个,J:%D,P数:%d,I,J,P);
                xyCov = calcCov(X + I * N,X + J * N,N,亩[I],亩[J]);
                *(更正+ J * P + I)= xyCov /(SD [I] * SD [J]);
            }
        }
    }其他{
        对于(i = RowStart; I&LT; RowEnd;我++){
            为(J = ColStart; J&LT; ColEnd; J ++){
                xyCov = calcCov(X + I * N,X + J * N,N,亩[I],亩[J]);
                *(更正+ J * P + I)= xyCov /(SD [I] * SD [J]);
            }
        }
    }
}
//函数来计算平均双calcMean(双* X,INT N){
    双S = 0;
    INT I;
    对于(i = 0; I&LT; N;我++){
        S = S + *(X + I);
    }
    返回(S / N);
}//函数计算标准devation双calcStdev(双* X,双亩,INT N){
    双T,SD = 0;
    INT I;    对于(i = 0; I&LT; N;我++){
        T = *(X + I) - 亩;
        SD = SD + T * T;
    }
    返回(SQRT(SD /(N-1)));
}
//函数来计算方差双calcCov(双* X,双* Y,INT N,厦门大学双,双YMU){
    双S = 0;
    INT I;    对于(i = 0; I&LT; N;我++){
        S = S +(*(X + I)-xmu)*(*(Y + I)-ymu);
    }
    返回(S /(N-1));
}


解决方案

使用快速BLAS(通过革命R或转到BLAS),您可以计算所有这些相关性快速R中,而无需编写任何C code。
在我的第一代酷睿i7英特尔PC它需要16秒时:

  N = 400;
M = 1e6个电子;#生成数据
垫=矩阵(runif(M * N),N,M);
#启动计时器
TIC = proc.time();
#中心每个变量
垫垫= - rowMeans(垫);
#规范每个变量
垫垫= /开方(rowSums(垫^ 2));
#相关计算
CR = tcrossprod(垫);
#停止计时器
TOC = proc.time();#显示结果和时间
展会(CR [1:4,1:4]);
展会(TOC-TIC)

第r code上述报告以下时间:

 用户系统经过
31.82 1.98 15.74

我用我的 MatrixEQTL 包这种方法。

<一href=\"http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/\">http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

有关的R各种BLAS选项的详细信息,请访问:

<一href=\"http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html#large\">http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html#large

My project for today was to write a fast correlation routine in R using the basic skillset I have. I have to find the correlation between almost 400 variables each having almost a million observations (i.e. a matrix of size p=1MM rows & n=400 cols).

R's native correlation function takes almost 2 mins for 1MM rows and 200 observations per variable. I have not run for 400 observations per column, but my guess is it will take almost 8 mins. I have less than 30 secs to finish it.

Hence, I want to do do things.

1 - write a simple correlation function in C and apply it in blocks parallely (see below).

2 - The blocks - split the correlation matrix in three blocks (top left square of size K*K, bottom right square of size (p-K)(p-K), and top right rectangular matrix of size K(p-K)). This covers all cells in the correlation matrix corr since I only need the upper triangle.

3 - run the C function via a .C call parallely using snowfall.

n = 100
p = 10
X = matrix(rnorm(n*p), nrow=n, ncol=p)
corr = matrix(0, nrow=p, ncol=p)

# calculation of column-wise mean and sd to pass to corr function
mu = colMeans(X)
sd = sapply(1:dim(X)[2], function(x) sd(X[,x]))

# setting up submatrix row and column ranges
K = as.integer(p/2)

RowRange = list()
ColRange = list()
RowRange[[1]] = c(0, K)
ColRange[[1]] = c(0, K)

RowRange[[2]] = c(0, K)
ColRange[[2]] = c(K, p+1)

RowRange[[3]] = c(K, p+1)
ColRange[[3]] = c(K, p+1)

# METHOD 1. NOT PARALLEL
########################
# function to calculate correlation on submatrices
BigCorr <- function(x){
  Rows = RowRange[[x]]
  Cols = ColRange[[x]]    
  return(.C("rCorrelationWrapper2", as.matrix(X), as.integer(dim(X)), 
            as.double(mu), as.double(sd), 
            as.integer(Rows), as.integer(Cols), 
            as.matrix(corr)))
}

res = list()
for(i in 1:3){
  res[[i]] = BigCorr(i)
}

# METHOD 2
########################
BigCorr <- function(x){
    Rows = RowRange[[x]]
    Cols = ColRange[[x]]    
    dyn.load("./rCorrelation.so")
    return(.C("rCorrelationWrapper2", as.matrix(X), as.integer(dim(X)), 
          as.double(mu), as.double(sd), 
          as.integer(Rows), as.integer(Cols), 
          as.matrix(corr)))
}

# parallelization setup
NUM_CPU = 4
library('snowfall')
sfSetMaxCPUs() # maximum cpu processing
sfInit(parallel=TRUE,cpus=NUM_CPU) # init parallel procs
sfExport("X", "RowRange", "ColRange", "sd", "mu", "corr")  
res = sfLapply(1:3, BigCorr)
sfStop()  

Here is my problem:

for method 1, it works, but not the way I want it to. I believed, that when I pass the corr matrix, I am passing an address and C would be making changes at source.

# Output of METHOD 1
> res[[1]][[7]]
      [,1]      [,2]        [,3]       [,4]         [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1 0.1040506 -0.01003125 0.23716384 -0.088246793    0    0    0    0     0
 [2,]    0 1.0000000 -0.09795989 0.11274508  0.025754150    0    0    0    0     0
 [3,]    0 0.0000000  1.00000000 0.09221441  0.052923520    0    0    0    0     0
 [4,]    0 0.0000000  0.00000000 1.00000000 -0.000449975    0    0    0    0     0
 [5,]    0 0.0000000  0.00000000 0.00000000  1.000000000    0    0    0    0     0
 [6,]    0 0.0000000  0.00000000 0.00000000  0.000000000    0    0    0    0     0
 [7,]    0 0.0000000  0.00000000 0.00000000  0.000000000    0    0    0    0     0
 [8,]    0 0.0000000  0.00000000 0.00000000  0.000000000    0    0    0    0     0
 [9,]    0 0.0000000  0.00000000 0.00000000  0.000000000    0    0    0    0     0
[10,]    0 0.0000000  0.00000000 0.00000000  0.000000000    0    0    0    0     0
> res[[2]][[7]]
      [,1] [,2] [,3] [,4] [,5]        [,6]        [,7]        [,8]       [,9]       [,10]
 [1,]    0    0    0    0    0 -0.02261175 -0.23398448 -0.02382690 -0.1447913 -0.09668318
 [2,]    0    0    0    0    0 -0.03439707  0.04580888  0.13229376  0.1354754 -0.03376527
 [3,]    0    0    0    0    0  0.10360907 -0.05490361 -0.01237932 -0.1657041  0.08123683
 [4,]    0    0    0    0    0  0.18259522 -0.23849323 -0.15928474  0.1648969 -0.05005328
 [5,]    0    0    0    0    0 -0.01012952 -0.03482429  0.14680301 -0.1112500  0.02801333
 [6,]    0    0    0    0    0  0.00000000  0.00000000  0.00000000  0.0000000  0.00000000
 [7,]    0    0    0    0    0  0.00000000  0.00000000  0.00000000  0.0000000  0.00000000
 [8,]    0    0    0    0    0  0.00000000  0.00000000  0.00000000  0.0000000  0.00000000
 [9,]    0    0    0    0    0  0.00000000  0.00000000  0.00000000  0.0000000  0.00000000
[10,]    0    0    0    0    0  0.00000000  0.00000000  0.00000000  0.0000000  0.00000000
> res[[3]][[7]]
      [,1] [,2] [,3] [,4] [,5] [,6]       [,7]        [,8]        [,9]       [,10]
 [1,]    0    0    0    0    0    0 0.00000000  0.00000000  0.00000000  0.00000000
 [2,]    0    0    0    0    0    0 0.00000000  0.00000000  0.00000000  0.00000000
 [3,]    0    0    0    0    0    0 0.00000000  0.00000000  0.00000000  0.00000000
 [4,]    0    0    0    0    0    0 0.00000000  0.00000000  0.00000000  0.00000000
 [5,]    0    0    0    0    0    0 0.00000000  0.00000000  0.00000000  0.00000000
 [6,]    0    0    0    0    0    1 0.03234195 -0.03488812 -0.18570151  0.14064640
 [7,]    0    0    0    0    0    0 1.00000000  0.03449697 -0.06765511 -0.15057244
 [8,]    0    0    0    0    0    0 0.00000000  1.00000000 -0.03426464  0.10030619
 [9,]    0    0    0    0    0    0 0.00000000  0.00000000  1.00000000 -0.08720512
[10,]    0    0    0    0    0    0 0.00000000  0.00000000  0.00000000  1.00000000

But the original corr matrix remains unchanged:

> corr
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    0    0    0    0    0    0    0    0     0
 [2,]    0    0    0    0    0    0    0    0    0     0
 [3,]    0    0    0    0    0    0    0    0    0     0
 [4,]    0    0    0    0    0    0    0    0    0     0
 [5,]    0    0    0    0    0    0    0    0    0     0
 [6,]    0    0    0    0    0    0    0    0    0     0
 [7,]    0    0    0    0    0    0    0    0    0     0
 [8,]    0    0    0    0    0    0    0    0    0     0
 [9,]    0    0    0    0    0    0    0    0    0     0
[10,]    0    0    0    0    0    0    0    0    0     0

Question #1: Is there any way to ensure that the C function changes values of corr at source? I can still merge these three to create an upper triangular correlation matrix, but I wanted to know if change at source is possible. Note: this does not help me accomplish fast correlation since I am merely running a loop.

Question #2: For METHOD 2, how do I load the shared object to each core for parallel jobs on each core at the init step (and not how I have done it)?

Question #3: What does this error mean? I need some pointers, and I would love to debug it myself.

Question #4: Is there a fast way of calculating correlation over matrices 1MM by 400, in less then 30 secs?

When I run METHOD 2, I get the following error:

R(6107) malloc: *** error for object 0x100664df8: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
Error in unserialize(node$con) : error reading from connection

Attached below is my plain vanilla C code for correlation:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <stddef.h>
#include <R.h> // to show errors in R


double calcMean (double *x, int n);
double calcStdev (double *x, double mu, int n);
double calcCov(double *x, double *y, int n, double xmu, double ymu);        

void rCorrelationWrapper2 ( double *X, int *dim, double *mu, double *sd, int *RowRange, int *ColRange, double *corr) {

    int i, j, n = dim[0], p = dim[1];
    int RowStart = RowRange[0], RowEnd = RowRange[1], ColStart = ColRange[0], ColEnd = ColRange[1];
    double xyCov;

    Rprintf("\n p: %d, %d <= row < %d, %d <= col < %d", p, RowStart, RowEnd, ColStart, ColEnd);

    if(RowStart==ColStart && RowEnd==ColEnd){
        for(i=RowStart; i<RowEnd; i++){
            for(j=i; j<ColEnd; j++){
                Rprintf("\n i: %d, j: %d, p: %d", i, j, p);
                xyCov = calcCov(X + i*n, X + j*n, n, mu[i], mu[j]);
                *(corr + j*p + i) = xyCov/(sd[i]*sd[j]);
            }
        }
    } else {
        for(i=RowStart; i<RowEnd; i++){
            for (j=ColStart; j<ColEnd; j++){
                xyCov = calcCov(X + i*n, X + j*n, n, mu[i], mu[j]);
                *(corr + j*p + i) = xyCov/(sd[i]*sd[j]);
            }
        }
    }
}


// function to calculate mean

double calcMean (double *x, int n){
    double s = 0;
    int i;
    for(i=0; i<n; i++){     
        s = s + *(x+i);
    }
    return(s/n);
}

// function to calculate standard devation

double calcStdev (double *x, double mu, int n){
    double t, sd = 0;
    int i;

    for (i=0; i<n; i++){
        t = *(x + i) - mu;
        sd = sd + t*t;
    }    
    return(sqrt(sd/(n-1)));
}


// function to calculate covariance

double calcCov(double *x, double *y, int n, double xmu, double ymu){
    double s = 0;
    int i;

    for(i=0; i<n; i++){
        s = s + (*(x+i)-xmu)*(*(y+i)-ymu);
    }
    return(s/(n-1));
}

解决方案

Using a fast BLAS (via Revolution R or Goto BLAS) you can calculate all these correlations fast in R without writing any C code. On my first generation Intel i7 PC it takes 16 seconds:

n = 400;
m = 1e6;

# Generate data
mat = matrix(runif(m*n),n,m);
# Start timer
tic = proc.time();
# Center each variable
mat = mat - rowMeans(mat);
# Standardize each variable
mat = mat / sqrt(rowSums(mat^2));   
# Calculate correlations
cr = tcrossprod(mat);
# Stop timer
toc = proc.time();

# Show the results and the time
show(cr[1:4,1:4]);
show(toc-tic)

The R code above reports the following timing:

 user  system elapsed 
31.82    1.98   15.74 

I use this approach in my MatrixEQTL package.
http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

More information about various BLAS options for R is available here:
http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html#large

这篇关于使用C和并行R中快速相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆