分布式互相关矩阵计算 [英] Distributed cross correlation matrix computation

查看:237
本文介绍了分布式互相关矩阵计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何才能以分布式方式计算大型(> 10TB)数据集的皮尔逊互相关矩阵?任何有效的分布式算法建议都会受到赞赏。

How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.

更新:
我阅读了apache spark mlib相关性的实现

update: I read the implementation of apache spark mlib correlation

Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

但是对我来说,似乎所有计算都在一个节点上进行,并且并不是真正意义上的分布。

but for me it looks like all the computation is happening at one node and it is not distributed in real sense.

请在此处说明一下。我还尝试在3节点Spark集群上执行它,以下是屏幕截图:

Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:


从第二张图片中可以看到,数据在一个节点上拉,然后进行了计算。我在这里吗?

As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?

推荐答案

首先,请看,以查看情况是否正确。然后,您可以参考以下任何实现:MPI / OpenMP: Agomezl Meismyles ,MapReduce: Vangjee Seawolf42 。之前阅读也很有趣你继续。另一方面,詹姆斯的论文如果您对计算对异常值具有鲁棒性的相关性。

To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.

这篇关于分布式互相关矩阵计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆