Spark :: KMeans两次调用takeSample()? [英] Spark::KMeans calls takeSample() twice?

查看:128
本文介绍了Spark :: KMeans两次调用takeSample()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多数据,并且已经对基数分区[20k,200k +]进行了实验.

I have many data and I have experimented with partitions of cardinality [20k, 200k+].

我这样称呼它:

from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)

,我看到然后

Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?

注意:我执行了更多的KMeans(),并且它们都调用两个takeSample(),而不管是否需要.cache()数据.

Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.

此外,分区数不影响takeSample()的调用数,它恒定为2.

Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.

我正在使用Spark 1.6.2(并且无法升级),并且如果需要的话,我的应用程序也使用Python!

I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!

我把它带到了Spark开发者的邮件列表中,所以我要进行更新:

I brought this to the mailing list of the Spark devs, so I am updating:

第一个takeSample()的详细信息:

第二个takeSample()的详细信息:

可以看到执行了相同的代码.

where one can see that the same code is executed.

推荐答案

如Shivaram Venkataraman在Spark邮件列表中所建议的:

我认为takeSample本身会运行多个作业(如果样本量很大) 在第一阶段收集的数据还不够.注释和代码路径 在 GitHub 应该解释这种情况何时发生.您也可以通过以下方式确认 检查logWarning是否出现在您的日志中.

I think takeSample itself runs multiple jobs if the amount of samples collected in the first pass is not enough. The comment and code path at GitHub should explain when this happens. Also you can confirm this by checking if the logWarning shows up in your logs.

// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
  logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
  samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
  numIters += 1
}

但是,正如人们看到的那样,第二条评论说它不应该经常发生,而且它确实总是在我身上发生,所以如果有人有其他想法,请告诉我.

However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.

也有人提出这是UI的问题,takeSample()实际上只被调用过一次,但这只是热空气.

It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.

这篇关于Spark :: KMeans两次调用takeSample()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆