使用C和并行R中快速相关 [英] Fast correlation in R using C and parallelization

查看：186 发布时间：2016/8/21 20:42:38 c r correlation snowfall

本文介绍了使用C和并行R中快速相关的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我今天的项目是使用基本技能我必须写在R A快速相关程序。我必须找到各有将近一百万的观测近400变量之间的相关性（即大小P = 1MM行和放大器的矩阵; N = 400 COLS）。

研究的本地相关函数花费1MM行近2分钟，每个变量200观察。我没有为每列400的观测运行，但我的猜测是，这将需要近8分钟。我有不到30秒完成它。

因此，我想要做的做的事情。

1 - C编写一个简单的相关函数，并将其应用在平行块（见下文）

。

2 - 块 - 在三个块（大小K * K，尺寸的右下方（PK）的（PK）和尺寸的K右上方矩形矩阵<顶部左方分裂相关矩阵/ EM>（PK））。这涵盖了所有细胞的相关矩阵科尔，因为我只需要上三角。

3 - 运行通过电话.C采用平行降雪C函数

  N = 100
P = 10
X =矩阵（RNORM（N * P），nrow = N，NCOL = P）
压改正=矩阵（0，nrow = P，NCOL = p）的列明智的均值和标准差的计算＃传递给CORR函数
亩= colMeans（X）
SD = sapply（1：昏暗（X）的[2]，函数（x）的标准差（X [中，x））＃设置子矩阵的行和列范围
K = as.integer（P / 2）RowRange =名单（）
ColRange =名单（）
RowRange [[1]] = C（0，K）的
ColRange [[1]] = C（0，K）的RowRange [[2] = C（0，K）的
ColRange [[2] = C（K，P + 1）的RowRange [[3]] = C（K，P + 1）的
ColRange [[3]] = C（K，P + 1）的＃方法1不平行
########################
＃函数来计算的相关子矩阵
BigCorr＆LT;  - 功能（X）{
  行= RowRange [[X]]
  COLS = ColRange [[X]]
  返回（.C（rCorrelationWrapper2，as.matrix（X），as.integer（点心（X）），
            as.double（亩），as.double（SD）
            as.integer（行），as.integer（COLS）
            as.matrix（科尔）））
}RES =名单（）
为（ⅰ在1：3）{
  RES [我] = BigCorr（I）
}＃方法2
########################
BigCorr＆LT;  - 功能（X）{
    行= RowRange [[X]]
    COLS = ColRange [[X]]
    dyn.load（./ rCorrelation.so）
    返回（.C（rCorrelationWrapper2，as.matrix（X），as.integer（点心（X）），
          as.double（亩），as.double（SD）
          as.integer（行），as.integer（COLS）
          as.matrix（科尔）））
}＃并行化设置
NUM_CPU = 4
库（'雪'）
sfSetMaxCPUs（）＃最大的CPU处理
sfInit（并行= TRUE，处理器= NUM_CPU）＃的init并行特效
sfExport（X，RowRange，ColRange，SD，μ，压改正）
RES = sfLapply（1：3，BigCorr）
sfStop（）

下面是我的问题：

有关方法1，它的工作原理，但不是我想要的方式。我相信，当我通过科尔矩阵，我传递一个地址和C将作出在源代码更改。

 方法1＃输出
＆GT;水库[[1]] [7]
      [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
 [1，] 1 0.1040506 -0.01003125 0.23716384 -0.088246793 0 0 0 0 0
 [2，] 0 1.0000000 -0.09795989 0.11274508 0.025754150 0 0 0 0 0
 [3] 0 0.0000000 1.00000000 0.09221441 0.052923520 0 0 0 0 0
 [4，] 0 0.0000000 0.00000000 1.00000000 -0.000449975 0 0 0 0 0
 [5,1] 0 0.0000000 0.00000000 0.00000000 1.000000000 0 0 0 0 0
 [6] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
 [7] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
 [8] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
 [9] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
[10] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0
＆GT;水库[[2]] [[7]
      [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
 [1，] 0 0 0 0 0 -0.02261175 -0.23398448 -0.02382690 -0.09668318 -0.1447913
 [2，] 0 0 0 0 0 -0.03439707 0.04580888 0.13229376 0.1354754 -0.03376527
 [3，] 0 0 0 0 0 0.10360907 -0.05490361 -0.01237932 0.08123683 -0.1657041
 [4，] 0 0 0 0 0 0.18259522 -0.23849323 -0.15928474 0.1648969 -0.05005328
 [5,1] 0 0 0 0 0 -0.01012952 -0.03482429 0.14680301 -0.1112500 0.02801333
 [6] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
 [7] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
 [8] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
 [9] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
[10] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
＆GT;水库[[3]] [[7]
      [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
 [1，] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [2，] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [3] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [4，] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [5,1] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000
 [6，] 0 0 0 0 0 1 0.03234195 -0.03488812 -0.18570151 0.14064640
 [7，] 0 0 0 0 0 0 1.00000000 0.03449697 -0.06765511 -0.15057244
 [8，] 0 0 0 0 0 0 0.00000000 1.00000000 -0.03426464 0.10030619
 [9] 0 0 0 0 0 0 0.00000000 0.00000000 1.00000000 -0.08720512
[10] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 1.00000000

但原科尔矩阵保持不变：

 ＆GT;科尔
      [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
 [1，] 0 0 0 0 0 0 0 0 0 0
 [2，] 0 0 0 0 0 0 0 0 0 0
 [3，] 0 0 0 0 0 0 0 0 0 0
 [4，] 0 0 0 0 0 0 0 0 0 0
 [5,1] 0 0 0 0 0 0 0 0 0 0
 [6，] 0 0 0 0 0 0 0 0 0 0
 [7，] 0 0 0 0 0 0 0 0 0 0
 [8，] 0 0 0 0 0 0 0 0 0 0
 [9，] 0 0 0 0 0 0 0 0 0 0
[10，] 0 0 0 0 0 0 0 0 0 0

问题1：有没有什么办法，以确保C函数在源头改变科尔的价值？我仍然可以合并这三个创建上三角相关矩阵，但我想知道，如果在源头改变是可能的。注意：这并不能帮助我实现快速相关，因为我只是运行一个循环

问题2：对于方法2，我怎么加载共享对象每个内核并行作业上的初始步骤每个核心（而不是如何我都做到了）

问题3：这是什么错误呢？我需要一些指点，我很想来调试它自己。

问题4：有400多计算矩阵1MM的相关性，在不到30秒的快速方法

当我运行方法2，我收到以下错误：

  R（6107）的malloc：***错误对象0x100664df8：为释放对象不正确的校验 - 对象被释放后，可能被修改。
***在malloc_error_break设置一个断点调试
错误反序列化（节点$ CON）：从连接读取错误

下面附

是我的单纯功能C code代表的相关性：

 的#include＆LT;＆stdio.h中GT;
＃包括LT＆;＆math.h中GT;
＃包括LT＆;＆stdlib.h中GT;
＃包括LT＆;＆STDDEF.H GT;
＃包括LT＆;＆R.H GT; //显示中的R的错误
双calcMean（双* X，INT N）;
双calcStdev（双* X，双亩，INT N）;
双calcCov（双* X，双* Y，INT N，厦门大学双，双YMU）;无效rCorrelationWrapper2（双* X，INT *黯淡，双*万亩，双SD *，为int * RowRange，为int * ColRange，双*科尔）{    INT I，J，N =暗淡[0]，P =暗淡[1];
    INT RowStart = RowRange [0]，RowEnd = RowRange [1]，ColStart = ColRange [0]，ColEnd = ColRange [1];
    双xyCov;    Rprintf（\\ n号码：％D，％D＆LT; =＆行LT;％D，％D＆LT; =＆山坳LT;％D，P，RowStart，RowEnd，ColStart，ColEnd）;    如果（RowStart == ColStart＆放大器;＆安培; RowEnd == ColEnd）{
        对于（i = RowStart; I＆LT; RowEnd;我++）{
            为（J = I; J＆LT; ColEnd; J ++）{
                Rprintf（\\ N I：％d个，J：％D，P数：％d，I，J，P）;
                xyCov = calcCov（X + I * N，X + J * N，N，亩[I]，亩[J]）;
                *（更正+ J * P + I）= xyCov /（SD [I] * SD [J]）;
            }
        }
    }其他{
        对于（i = RowStart; I＆LT; RowEnd;我++）{
            为（J = ColStart; J＆LT; ColEnd; J ++）{
                xyCov = calcCov（X + I * N，X + J * N，N，亩[I]，亩[J]）;
                *（更正+ J * P + I）= xyCov /（SD [I] * SD [J]）;
            }
        }
    }
}
//函数来计算平均双calcMean（双* X，INT N）{
    双S = 0;
    INT I;
    对于（i = 0; I＆LT; N;我++）{
        S = S + *（X + I）;
    }
    返回（S / N）;
}//函数计算标准devation双calcStdev（双* X，双亩，INT N）{
    双T，SD = 0;
    INT I;    对于（i = 0; I＆LT; N;我++）{
        T = *（X + I） - 亩;
        SD = SD + T * T;
    }
    返回（SQRT（SD /（N-1）））;
}
//函数来计算方差双calcCov（双* X，双* Y，INT N，厦门大学双，双YMU）{
    双S = 0;
    INT I;    对于（i = 0; I＆LT; N;我++）{
        S = S +（*（X + I）-xmu）*（*（Y + I）-ymu）;
    }
    返回（S /（N-1））;
}

解决方案

使用快速BLAS（通过革命R或转到BLAS），您可以计算所有这些相关性快速R中，而无需编写任何C code。
在我的第一代酷睿i7英特尔PC它需要16秒时：
N = 400; M = 1e6个电子;＃生成数据垫=矩阵（runif（M * N），N，M）; ＃启动计时器 TIC = proc.time（）; ＃中心每个变量垫垫= - rowMeans（垫）; ＃规范每个变量垫垫= /开方（rowSums（垫^ 2））; ＃相关计算 CR = tcrossprod（垫）; ＃停止计时器 TOC = proc.time（）;＃显示结果和时间展会（CR [1：4,1：4]）; 展会（TOC-TIC）
第r code上述报告以下时间：
用户系统经过 31.82 1.98 15.74
我用我的 MatrixEQTL 包这种方法。

<一href=\"http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/\">http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/
有关的R各种BLAS选项的详细信息，请访问：

<一href=\"http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html#large\">http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html#large
My project for today was to write a fast correlation routine in R using the basic skillset I have. I have to find the correlation between almost 400 variables each having almost a million observations (i.e. a matrix of size p=1MM rows & n=400 cols).

R's native correlation function takes almost 2 mins for 1MM rows and 200 observations per variable. I have not run for 400 observations per column, but my guess is it will take almost 8 mins. I have less than 30 secs to finish it.

Hence, I want to do do things.

1 - write a simple correlation function in C and apply it in blocks parallely (see below).

2 - The blocks - split the correlation matrix in three blocks (top left square of size K*K, bottom right square of size (p-K)(p-K), and top right rectangular matrix of size K(p-K)). This covers all cells in the correlation matrix corr since I only need the upper triangle.

3 - run the C function via a .C call parallely using snowfall.
n = 100 p = 10 X = matrix(rnorm(n*p), nrow=n, ncol=p) corr = matrix(0, nrow=p, ncol=p) # calculation of column-wise mean and sd to pass to corr function mu = colMeans(X) sd = sapply(1:dim(X)[2], function(x) sd(X[,x])) # setting up submatrix row and column ranges K = as.integer(p/2) RowRange = list() ColRange = list() RowRange[[1]] = c(0, K) ColRange[[1]] = c(0, K) RowRange[[2]] = c(0, K) ColRange[[2]] = c(K, p+1) RowRange[[3]] = c(K, p+1) ColRange[[3]] = c(K, p+1) # METHOD 1. NOT PARALLEL ######################## # function to calculate correlation on submatrices BigCorr <- function(x){ Rows = RowRange[[x]] Cols = ColRange[[x]] return(.C("rCorrelationWrapper2", as.matrix(X), as.integer(dim(X)), as.double(mu), as.double(sd), as.integer(Rows), as.integer(Cols), as.matrix(corr))) } res = list() for(i in 1:3){ res[[i]] = BigCorr(i) } # METHOD 2 ######################## BigCorr <- function(x){ Rows = RowRange[[x]] Cols = ColRange[[x]] dyn.load("./rCorrelation.so") return(.C("rCorrelationWrapper2", as.matrix(X), as.integer(dim(X)), as.double(mu), as.double(sd), as.integer(Rows), as.integer(Cols), as.matrix(corr))) } # parallelization setup NUM_CPU = 4 library('snowfall') sfSetMaxCPUs() # maximum cpu processing sfInit(parallel=TRUE,cpus=NUM_CPU) # init parallel procs sfExport("X", "RowRange", "ColRange", "sd", "mu", "corr") res = sfLapply(1:3, BigCorr) sfStop()
Here is my problem:

for method 1, it works, but not the way I want it to. I believed, that when I pass the corr matrix, I am passing an address and C would be making changes at source.
# Output of METHOD 1 > res[[1]][[7]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 0.1040506 -0.01003125 0.23716384 -0.088246793 0 0 0 0 0 [2,] 0 1.0000000 -0.09795989 0.11274508 0.025754150 0 0 0 0 0 [3,] 0 0.0000000 1.00000000 0.09221441 0.052923520 0 0 0 0 0 [4,] 0 0.0000000 0.00000000 1.00000000 -0.000449975 0 0 0 0 0 [5,] 0 0.0000000 0.00000000 0.00000000 1.000000000 0 0 0 0 0 [6,] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0 [7,] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0 [8,] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0 [9,] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0 [10,] 0 0.0000000 0.00000000 0.00000000 0.000000000 0 0 0 0 0 > res[[2]][[7]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 0 0 0 0 -0.02261175 -0.23398448 -0.02382690 -0.1447913 -0.09668318 [2,] 0 0 0 0 0 -0.03439707 0.04580888 0.13229376 0.1354754 -0.03376527 [3,] 0 0 0 0 0 0.10360907 -0.05490361 -0.01237932 -0.1657041 0.08123683 [4,] 0 0 0 0 0 0.18259522 -0.23849323 -0.15928474 0.1648969 -0.05005328 [5,] 0 0 0 0 0 -0.01012952 -0.03482429 0.14680301 -0.1112500 0.02801333 [6,] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000 [7,] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000 [8,] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000 [9,] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000 [10,] 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000 > res[[3]][[7]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000 [2,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000 [3,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000 [4,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000 [5,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 0.00000000 [6,] 0 0 0 0 0 1 0.03234195 -0.03488812 -0.18570151 0.14064640 [7,] 0 0 0 0 0 0 1.00000000 0.03449697 -0.06765511 -0.15057244 [8,] 0 0 0 0 0 0 0.00000000 1.00000000 -0.03426464 0.10030619 [9,] 0 0 0 0 0 0 0.00000000 0.00000000 1.00000000 -0.08720512 [10,] 0 0 0 0 0 0 0.00000000 0.00000000 0.00000000 1.00000000
But the original corr matrix remains unchanged:
> corr [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 0 0 0 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 0 0 0 0 [3,] 0 0 0 0 0 0 0 0 0 0 [4,] 0 0 0 0 0 0 0 0 0 0 [5,] 0 0 0 0 0 0 0 0 0 0 [6,] 0 0 0 0 0 0 0 0 0 0 [7,] 0 0 0 0 0 0 0 0 0 0 [8,] 0 0 0 0 0 0 0 0 0 0 [9,] 0 0 0 0 0 0 0 0 0 0 [10,] 0 0 0 0 0 0 0 0 0 0
Question #1: Is there any way to ensure that the C function changes values of corr at source? I can still merge these three to create an upper triangular correlation matrix, but I wanted to know if change at source is possible. Note: this does not help me accomplish fast correlation since I am merely running a loop.

Question #2: For METHOD 2, how do I load the shared object to each core for parallel jobs on each core at the init step (and not how I have done it)?

Question #3: What does this error mean? I need some pointers, and I would love to debug it myself.

Question #4: Is there a fast way of calculating correlation over matrices 1MM by 400, in less then 30 secs?

When I run METHOD 2, I get the following error:
R(6107) malloc: *** error for object 0x100664df8: incorrect checksum for freed object - object was probably modified after being freed. *** set a breakpoint in malloc_error_break to debug Error in unserialize(node$con) : error reading from connection
Attached below is my plain vanilla C code for correlation:
#include <stdio.h> #include <math.h> #include <stdlib.h> #include <stddef.h> #include <R.h> // to show errors in R double calcMean (double *x, int n); double calcStdev (double *x, double mu, int n); double calcCov(double *x, double *y, int n, double xmu, double ymu); void rCorrelationWrapper2 ( double *X, int *dim, double *mu, double *sd, int *RowRange, int *ColRange, double *corr) { int i, j, n = dim[0], p = dim[1]; int RowStart = RowRange[0], RowEnd = RowRange[1], ColStart = ColRange[0], ColEnd = ColRange[1]; double xyCov; Rprintf("\n p: %d, %d <= row < %d, %d <= col < %d", p, RowStart, RowEnd, ColStart, ColEnd); if(RowStart==ColStart && RowEnd==ColEnd){ for(i=RowStart; i<RowEnd; i++){ for(j=i; j<ColEnd; j++){ Rprintf("\n i: %d, j: %d, p: %d", i, j, p); xyCov = calcCov(X + i*n, X + j*n, n, mu[i], mu[j]); *(corr + j*p + i) = xyCov/(sd[i]*sd[j]); } } } else { for(i=RowStart; i<RowEnd; i++){ for (j=ColStart; j<ColEnd; j++){ xyCov = calcCov(X + i*n, X + j*n, n, mu[i], mu[j]); *(corr + j*p + i) = xyCov/(sd[i]*sd[j]); } } } } // function to calculate mean double calcMean (double *x, int n){ double s = 0; int i; for(i=0; i<n; i++){ s = s + *(x+i); } return(s/n); } // function to calculate standard devation double calcStdev (double *x, double mu, int n){ double t, sd = 0; int i; for (i=0; i<n; i++){ t = *(x + i) - mu; sd = sd + t*t; } return(sqrt(sd/(n-1))); } // function to calculate covariance double calcCov(double *x, double *y, int n, double xmu, double ymu){ double s = 0; int i; for(i=0; i<n; i++){ s = s + (*(x+i)-xmu)*(*(y+i)-ymu); } return(s/(n-1)); }

解决方案
Using a fast BLAS (via Revolution R or Goto BLAS) you can calculate all these correlations fast in R without writing any C code. On my first generation Intel i7 PC it takes 16 seconds:
n = 400; m = 1e6; # Generate data mat = matrix(runif(m*n),n,m); # Start timer tic = proc.time(); # Center each variable mat = mat - rowMeans(mat); # Standardize each variable mat = mat / sqrt(rowSums(mat^2)); # Calculate correlations cr = tcrossprod(mat); # Stop timer toc = proc.time(); # Show the results and the time show(cr[1:4,1:4]); show(toc-tic)
The R code above reports the following timing:
user system elapsed 31.82 1.98 15.74
I use this approach in my MatrixEQTL package.
http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

More information about various BLAS options for R is available here:
http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html#large

这篇关于使用C和并行R中快速相关的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用C和并行R中快速相关 [英] Fast correlation in R using C and parallelization

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

使用C和并行R中快速相关 [英] Fast correlation in R using C and parallelization

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭