分布式互相关矩阵计算 [英] Distributed cross correlation matrix computation
问题描述
如何才能以分布式方式计算大型(> 10TB)数据集的皮尔逊互相关矩阵?任何有效的分布式算法建议都会受到赞赏。
How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.
更新:
我阅读了apache spark mlib相关性的实现
update: I read the implementation of apache spark mlib correlation
Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
但是对我来说,似乎所有计算都在一个节点上进行,并且并不是真正意义上的分布。
but for me it looks like all the computation is happening at one node and it is not distributed in real sense.
请在此处说明一下。我还尝试在3节点Spark集群上执行它,以下是屏幕截图:
Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:
从第二张图片中可以看到,数据在一个节点上拉,然后进行了计算。我在这里吗?
As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?
推荐答案
首先,请看此,以查看情况是否正确。然后,您可以参考以下任何实现:MPI / OpenMP: Agomezl 或 Meismyles ,MapReduce: Vangjee 或 Seawolf42 。之前阅读此也很有趣你继续。另一方面,詹姆斯的论文如果您对计算对异常值具有鲁棒性的相关性。
To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.
这篇关于分布式互相关矩阵计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!