Spark的KMeans是否无法处理大数据? [英] Is Spark's KMeans unable to handle bigdata?

查看：195 发布时间：2020/4/26 10:18:16 python apache-spark k-means apache-spark-mllib bigdata

本文介绍了Spark的KMeans是否无法处理大数据?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

KMeans的火车，其初始化模式默认为kmeans ||.问题在于它可以快速前进(不到10分钟)到前13个阶段，但随后完全挂起，而不会产生错误！

KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely, without yielding an error!

最小示例再现了问题(如果我使用1000点或随机初始化，它将成功):

Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization):

from pyspark.context import SparkContext

from pyspark.mllib.clustering import KMeans
from pyspark.mllib.random import RandomRDDs


if __name__ == "__main__":
    sc = SparkContext(appName='kmeansMinimalExample')

    # same with 10000 points
    data = RandomRDDs.uniformVectorRDD(sc, 10000000, 64)
    C = KMeans.train(data, 8192,  maxIterations=10)    

    sc.stop()

该作业不执行任何操作(该操作不会成功，失败或进度..)，如下所示. 执行器"选项卡中没有活动/失败的任务. Stdout和Stderr日志没有什么特别有趣的:

The job does nothing (it doesn't succeed, fail or progress..), as shown below. There are no active/failed tasks in the Executors tab. Stdout and Stderr Logs don't have anything particularly interesting:

如果我使用k=81而不是8192，它将成功:

If I use k=81, instead of 8192, it will succeed:

请注意，takeSample()的两个调用应该不成问题，因为在随机初始化的情况下有两次被调用.

Notice that the two calls of takeSample(), should not be an issue, since there were called twice in the random initialization case.

那么，这是怎么回事? Spark的Kmeans是否无法缩放?有人知道吗你可以复制吗?

So, what is happening? Is Spark's Kmeans unable to scale? Does anybody know? Can you reproduce?

如果是内存问题，则我会收到警告和错误，因为我去过.

If it was a memory issue, I would get warnings and errors, as I had been before.

注意:placeybordeaux的注释基于作业在客户端模式下的执行，在该模式下，驱动程序的配置无效，从而导致退出代码143等(请参阅编辑历史记录)，而不是在集群模式下，在根本没有报告错误的地方，应用程序只是挂起了.

Note: placeybordeaux's comments are based on the execution of the job in client mode, where the driver's configurations are invalidated, causing the exit code 143 and such (see edit history), not in cluster mode, where there is no error reported at all, the application just hangs.

从zero323开始:为什么Spark Mllib KMeans算法极其缓慢?是相关的，但我认为他见证了一些进展，而我的挂起时，我确实发表了评论...

From zero323: Why is Spark Mllib KMeans algorithm extremely slow? is related, but I think he witnesses some progress, while mine hangs, I did leave a comment...

Spark的KMeans是否无法处理大数据? [英] Is Spark's KMeans unable to handle bigdata?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark的KMeans是否无法处理大数据? [英] Is Spark&#39;s KMeans unable to handle bigdata?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Spark的KMeans是否无法处理大数据? [英] Is Spark's KMeans unable to handle bigdata?

登录关闭