为什么Spark Mllib KMeans算法非常慢? [英] Why is Spark Mllib KMeans algorithm extremely slow?

查看：133 发布时间：2020/4/26 10:19:21 apache-spark cluster-analysis data-mining k-means apache-spark-mllib

本文介绍了为什么Spark Mllib KMeans算法非常慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到与此帖子相同的问题，但是我的积分不足，无法在此处添加评论.我的数据集有100万行，100个列.我也在使用Mllib KMeans，它非常慢.实际上，这项工作永远不会完成，我必须杀死它.我正在Google云(dataproc)上运行它.如果我要求的集群数量较少(k = 1000)，它将运行，但是仍然需要35分钟以上的时间.我需要它来运行k〜5000.我不知道为什么这么慢.给定工作器/节点的数量以及在100万x〜300,000 col矩阵上的SVD大约需要3分钟，才能对数据进行适当的分区，但是当涉及到KMeans时，它只是陷入了一个黑洞.我现在尝试的迭代次数较少(从2次而不是100次)，但是我觉得某个地方出了问题.

I'm having the same problem as in this post, but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the number of workers/nodes and SVD on a 1 million x ~300,000 col matrix takes ~3 minutes, but when it comes to KMeans it just goes into a black hole. I am now trying a lower number of iterations (2 instead of 100), but I feel something is wrong somewhere.

KMeansModel Cs = KMeans.train(datamatrix, k, 100);//100 iteration, changed to 2 now. # of clusters k=1000 or 5000

为什么Spark Mllib KMeans算法非常慢? [英] Why is Spark Mllib KMeans algorithm extremely slow?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

为什么Spark Mllib KMeans算法非常慢? [英] Why is Spark Mllib KMeans algorithm extremely slow?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭