为什么 Spark Mllib KMeans 算法非常慢? [英] Why is Spark Mllib KMeans algorithm extremely slow?

查看:40
本文介绍了为什么 Spark Mllib KMeans 算法非常慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了与此帖子相同的问题,但是我没有足够的积分在那里添加评论.我的数据集有 100 万行,100 列.我也在使用 Mllib KMeans,它非常慢.事实上,这项工作永远不会完成,我必须杀死它.我在谷歌云(dataproc)上运行它.如果我要求较少数量的集群(k=1000),它就会运行,但仍然需要超过 35 分钟.我需要它运行 k~5000.我不知道为什么它这么慢.考虑到工作人员/节点的数量和 100 万 x ~300,000 col 矩阵上的 SVD 需要大约 3 分钟,数据被正确分区,但是当涉及到 KMeans 时,它只是进入了一个黑洞.我现在尝试减少迭代次数(2 次而不是 100 次),但我觉得某处有问题.

I'm having the same problem as in this post, but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the number of workers/nodes and SVD on a 1 million x ~300,000 col matrix takes ~3 minutes, but when it comes to KMeans it just goes into a black hole. I am now trying a lower number of iterations (2 instead of 100), but I feel something is wrong somewhere.

KMeansModel Cs = KMeans.train(datamatrix, k, 100);//100 iteration, changed to 2 now. # of clusters k=1000 or 5000

推荐答案

看起来原因比较简单.您使用了相当大的 k 并将其与昂贵的初始化算法相结合.

It looks like the reason is relatively simple. You use quite large k and combine it with an expensive initialization algorithm.

默认情况下,Spark 使用 K-means++ 称为 K 均值||(参见 Spark MLLib 中 Kmeans++ 中的初始化步骤参数究竟是什么?).分布式版本大约是 O(k),因此 k 越大,启动越慢.这应该可以解释为什么当您减少迭代次数时没有看到任何改进.

By default Spark is using as distributed variant of K-means++ called K-means|| (see What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?). Distributed version is roughly O(k) so with larger k you can expect slower start. This should explain why you see no improvement when you reduce number of iterations.

在训练模型时使用大 K 也很昂贵.Spark 使用的是 Lloyds 的变体,大致是 O(nkdi).

Using large K is also expensive when model is trained. Spark is using a variant of Lloyds which is roughly O(nkdi).

如果您希望数据结构复杂,那么很可能有比 K-Means 更好的算法来处理这个问题,但如果您真的想坚持使用它,则可以从使用随机初始化开始.

If you expect complex structure of the data there most likely a better algorithms out there to handle this than K-Means but if you really want to stick with it you start with using random initialization.

这篇关于为什么 Spark Mllib KMeans 算法非常慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆