Spark如何在randomSplit中跟踪拆分? [英] How does Spark keep track of the splits in randomSplit?

查看：460 发布时间：2020/9/4 4:37:50 apache-spark apache-spark-sql spark-dataframe apache-spark-mllib

本文介绍了Spark如何在randomSplit中跟踪拆分?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

此问题说明了Spark的随机拆分的工作方式， Sparks RDD.randomSplit实际上是如何拆分RDD 的，但是我不明白spark如何跟踪哪些值进行了一次拆分，以便那些相同的值不会进行第二次拆分.

This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split.

如果我们看一下randomSplit的实现:

If we look at the implementation of randomSplit:

def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
 // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
 // constituent partitions each time a split is materialized which could result in
 // overlapping splits. To prevent this, we explicitly sort each input partition to make the
 // ordering deterministic.

 val sorted = Sort(logicalPlan.output.map(SortOrder(_, Ascending)), global = false, logicalPlan)
 val sum = weights.sum
 val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
 normalizedCumWeights.sliding(2).map { x =>
  new DataFrame(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted))
}.toArray
}

我们可以看到它创建了两个共享相同sqlContext并具有两个不同Sample(rs)的DataFrame.

we can see that it creates two DataFrames that share the same sqlContext and with two different Sample(rs).

这两个DataFrame如何相互通信，以便第二个中不包含第一个中的值?

How are these two DataFrame(s) communicating with each other so that a value that fell in the first one is not included in the second one?

数据是否被提取两次? (假设sqlContext是从数据库中选择的，选择是否被执行了两次?).

And is the data being fetched twice? (Assume the sqlContext is selecting from a DB, is the select being executed twice?).

Spark如何在randomSplit中跟踪拆分? [英] How does Spark keep track of the splits in randomSplit?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark如何在randomSplit中跟踪拆分? [英] How does Spark keep track of the splits in randomSplit?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭