聚类中的大距离矩阵 [英] Large distance matrix in clustering

查看:128
本文介绍了聚类中的大距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在具有16 GB RAM的计算机上运行R 3.2.3.我有一个300万行x 12列的大型矩阵.我想在R中使用分层聚类算法,因此在此之前,我试图创建一个距离矩阵.由于数据是混合类型的,因此我对不同类型使用不同的矩阵.我收到有关内存分配的错误:

I am running R 3.2.3 on a machine with 16 GB RAM. I have a large matrix of 3,00,000 rows x 12 columns. I wanna use a hierarchical clustering algorithm in R, so before I do that, I am trying to create a distance matrix. Since data is of mixed type, I use different matrices for different type. I get an error about memory allocation:

df <- as.data.frame(matrix(rnorm(36*10^5), nrow = 3*10^5))
d1=as.dist(distm(df[,c(1:2)])/10^5)
d2=dist(df[,c(3:8)], method = "euclidean") 
d3= hamming.distance(df[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)

我收到以下错误

> d1=as.dist(distm(df1[,c(1:2)])/10^5)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
> d2=dist(df1[,c(3:8)], method = "euclidean") 
Error: cannot allocate vector of size 335.3 Gb
In addition: Warning messages:
1: In dist(df1[, c(3:8)], method = "euclidean") :
 Reached total allocation of 16070Mb: see help(memory.size)
2: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
3: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
4: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
> d3= hamming.distance(df1[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)

推荐答案

简单来说,假设您有1行(A)与3 ^ 8矩阵(B)进行最小距离聚类.

To simple, let assume you have 1 row (A) to cluster with 3^8 matrix (B) by minimum distance.

原始方法是:

1. load A and B
2. distance compute A with each row of B
3. select smallest one from results (reduction)

但是由于B很大,因此您无法将其加载到内存中或在执行过程中出错.

But because of B is really large, you can't load it to memory or error out during execution.

批处理方法如下:

1. load A (suppose it is small)
2. load B.partial with 1 to 1^5 rows of B
3. compute distance of A with each row of B.partial
4. select min one in partial results and save it as res[i]
5. go back 2.) load next 1^5 rows of B 
6. final your got a 3000 partial results and saved in res[1:3000]
7. reduction : select min one from res[1:3000]
   note: if you need all distances as `dist` function, you don't need reduction and just keep this array.

代码将比原始代码复杂一些.但这是我们处理大数据问题时非常常见的技巧.对于计算部分,您可以在此处 a>.

The code will be a little complicated than original one. But this is very common trick when we deal with big data problem. For compute parts, you can refer one of my previous answers in here.

如果您可以在此处使用批处理模式粘贴最终代码,我将非常合适.这样其他人也可以学习.

I will be very appropriate if you can paste your final code with batch mode in here. So that others can study as well.

关于的另一件有趣的事情是dist,它是R软件包中支持openMP的少数几个.在此处中查看源代码>以及如何在此处中使用openMP进行编译.

Another interesting things about dist is that it is the few one in R package supporting openMP. See source code in here and how to compile with openMP in here.

因此,如果您可以根据计算机尝试将OMP_NUM_THREADS设置为4或8,然后再次运行,则可以看到很多性能改进!

So, if you can try set OMP_NUM_THREADS with 4 or 8 based on your machine and then run again, you can see the performance improvement a lot!

 void R_distance(double *x, int *nr, int *nc, double *d, int *diag,
    int *method, double *p)
{
     int dc, i, j;
     size_t  ij;  /* can exceed 2^31 - 1 */
     double (*distfun)(double*, int, int, int, int) = NULL;
     #ifdef _OPENMP
        int nthreads;
     #endif
     .....
 }

此外,如果要通过GPU加速dist,可以在talk 部分进行引用. > ParallelR .

Furthermore, if you want to accelerate dist by GPU, you can refer talk part in ParallelR.

这篇关于聚类中的大距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆