Spark的KMeans是否无法处理大数据? [英] Is Spark's KMeans unable to handle bigdata?

查看:195
本文介绍了Spark的KMeans是否无法处理大数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

KMeans的火车,其初始化模式默认为kmeans ||.问题在于它可以快速前进(不到10分钟)到前13个阶段,但随后完全挂起,而不会产生错误!

KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely, without yielding an error!

最小示例再现了问题(如果我使用1000点或随机初始化,它将成功):

Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization):

from pyspark.context import SparkContext

from pyspark.mllib.clustering import KMeans
from pyspark.mllib.random import RandomRDDs


if __name__ == "__main__":
    sc = SparkContext(appName='kmeansMinimalExample')

    # same with 10000 points
    data = RandomRDDs.uniformVectorRDD(sc, 10000000, 64)
    C = KMeans.train(data, 8192,  maxIterations=10)    

    sc.stop()

该作业不执行任何操作(该操作不会成功,失败或进度..),如下所示. 执行器"选项卡中没有活动/失败的任务. Stdout和Stderr日志没有什么特别有趣的:

The job does nothing (it doesn't succeed, fail or progress..), as shown below. There are no active/failed tasks in the Executors tab. Stdout and Stderr Logs don't have anything particularly interesting:

如果我使用k=81而不是8192,它将成功:

If I use k=81, instead of 8192, it will succeed:

请注意,takeSample()的两个调用应该不成问题,因为在随机初始化的情况下有两次被调用.

Notice that the two calls of takeSample(), should not be an issue, since there were called twice in the random initialization case.

那么,这是怎么回事? Spark的Kmeans是否无法缩放?有人知道吗你可以复制吗?

So, what is happening? Is Spark's Kmeans unable to scale? Does anybody know? Can you reproduce?

如果是内存问题,则我会收到警告和错误,因为我去过.

If it was a memory issue, I would get warnings and errors, as I had been before.

注意:placeybordeaux的注释基于作业在客户端模式下的执行,在该模式下,驱动程序的配置无效,从而导致退出代码143等(请参阅编辑历史记录),而不是在集群模式下,在根本没有报告错误的地方,应用程序只是挂起了.

Note: placeybordeaux's comments are based on the execution of the job in client mode, where the driver's configurations are invalidated, causing the exit code 143 and such (see edit history), not in cluster mode, where there is no error reported at all, the application just hangs.

从zero323开始:为什么Spark Mllib KMeans算法极其缓慢?是相关的,但我认为他见证了一些进展,而我的挂起时,我确实发表了评论...

From zero323: Why is Spark Mllib KMeans algorithm extremely slow? is related, but I think he witnesses some progress, while mine hangs, I did leave a comment...

推荐答案

我认为绞死"是因为您的执行者不断死亡.正如我在边聊中提到的那样,此代码在本地和群集中的Pyspark和Scala中对我来说运行良好.但是,它花费的时间比应该花费的时间长得多.几乎所有时间都用在k-means ||上.初始化.

I think the 'hanging' is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization.

我打开了 https://issues.apache.org/jira/browse/SPARK- 17389 跟踪两项主要改进,您现在可以使用其中一项.确实,另请参见 https://issues.apache.org/jira/browse/SPARK -11560

I opened https://issues.apache.org/jira/browse/SPARK-17389 to track two main improvements, one of which you can use now. really, see also https://issues.apache.org/jira/browse/SPARK-11560

首先,有一些代码优化可以将初始化速度提高约13%.

First, there are some code optimizations that would speed up the init by about 13%.

但是,大多数问题是它默认为5步k-均值||. init,似乎2几乎总是一样好.您可以将初始化步骤设置为2,以查看加速情况,尤其是在目前处于暂停状态的阶段.

However most of the issue is that it default to 5 steps of k-means|| init, when it seems that 2 is almost always just as good. You can set initialization steps to 2 to see a speedup, especially in the stage that's hanging now.

在笔记本电脑上进行的(较小的)测试中,初始化时间从5:54变为1:41,两者都有变化,主要是由于设置了初始化步骤.

In my (smaller) test on my laptop, init time went from 5:54 to 1:41 with both changes, mostly due to setting init steps.

这篇关于Spark的KMeans是否无法处理大数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆