Spark 的 KMeans 无法处理大数据吗? [英] Is Spark's KMeans unable to handle bigdata?

查看:24
本文介绍了Spark 的 KMeans 无法处理大数据吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

KMeans 的

如果我使用k=81,而不是8192,它会成功:

注意takeSample()的两次调用,

解决方案

我认为悬而未决"是因为你的执行者不断死去.正如我在旁白中提到的,这段代码在本地和集群上,在 Pyspark 和 Scala 中对我来说都运行良好.然而,它需要的时间比它应该的要长得多.几乎所有的时间都花在k-means上||初始化.

我打开了https://issues.apache.org/jira/browse/SPARK-17389 跟踪两项主要改进,您现在可以使用其中一项.真的,另见 https://issues.apache.org/jira/browse/SPARK-11560

首先,有一些代码优化可以将初始化速度提高约 13%.

然而,大部分问题在于它默认为 5 步 k-means||init 时,似乎 2 几乎总是一样好.您可以将初始化步骤设置为 2 以查看加速,尤其是在现在挂起的阶段.

在我的笔记本电脑上(较小的)测试中,初始化时间从 5:54 变为 1:41,这两个变化主要是由于设置了初始化步骤.

KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely, without yielding an error!

Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization):

from pyspark.context import SparkContext

from pyspark.mllib.clustering import KMeans
from pyspark.mllib.random import RandomRDDs


if __name__ == "__main__":
    sc = SparkContext(appName='kmeansMinimalExample')

    # same with 10000 points
    data = RandomRDDs.uniformVectorRDD(sc, 10000000, 64)
    C = KMeans.train(data, 8192,  maxIterations=10)    

    sc.stop()

The job does nothing (it doesn't succeed, fail or progress..), as shown below. There are no active/failed tasks in the Executors tab. Stdout and Stderr Logs don't have anything particularly interesting:

If I use k=81, instead of 8192, it will succeed:

Notice that the two calls of takeSample(), should not be an issue, since there were called twice in the random initialization case.

So, what is happening? Is Spark's Kmeans unable to scale? Does anybody know? Can you reproduce?


If it was a memory issue, I would get warnings and errors, as I had been before.

Note: placeybordeaux's comments are based on the execution of the job in client mode, where the driver's configurations are invalidated, causing the exit code 143 and such (see edit history), not in cluster mode, where there is no error reported at all, the application just hangs.


From zero323: Why is Spark Mllib KMeans algorithm extremely slow? is related, but I think he witnesses some progress, while mine hangs, I did leave a comment...

解决方案

I think the 'hanging' is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization.

I opened https://issues.apache.org/jira/browse/SPARK-17389 to track two main improvements, one of which you can use now. Edit: really, see also https://issues.apache.org/jira/browse/SPARK-11560

First, there are some code optimizations that would speed up the init by about 13%.

However most of the issue is that it default to 5 steps of k-means|| init, when it seems that 2 is almost always just as good. You can set initialization steps to 2 to see a speedup, especially in the stage that's hanging now.

In my (smaller) test on my laptop, init time went from 5:54 to 1:41 with both changes, mostly due to setting init steps.

这篇关于Spark 的 KMeans 无法处理大数据吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆