Spark 如何跟踪 randomSplit 中的拆分? [英] How does Spark keep track of the splits in randomSplit?

查看：32 发布时间：2021/11/14 22:10:09 apache-spark apache-spark-sql spark-dataframe apache-spark-mllib

本文介绍了Spark 如何跟踪 randomSplit 中的拆分?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这个问题解释了 Spark 的随机拆分是如何工作的，Sparks RDD.randomSplit 如何实际拆分 RDD，但我不明白 spark 如何跟踪哪些值进入一个拆分，以便这些相同的值不会进入第二个拆分.

This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split.

如果我们看一下 randomSplit 的实现:

If we look at the implementation of randomSplit:

def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
 // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
 // constituent partitions each time a split is materialized which could result in
 // overlapping splits. To prevent this, we explicitly sort each input partition to make the
 // ordering deterministic.

 val sorted = Sort(logicalPlan.output.map(SortOrder(_, Ascending)), global = false, logicalPlan)
 val sum = weights.sum
 val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
 normalizedCumWeights.sliding(2).map { x =>
  new DataFrame(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted))
}.toArray
}

我们可以看到它创建了两个共享相同 sqlContext 和两个不同 Sample(rs) 的 DataFrame.

we can see that it creates two DataFrames that share the same sqlContext and with two different Sample(rs).

这两个 DataFrame 如何相互通信，以便第一个中的值不包含在第二个中?

How are these two DataFrame(s) communicating with each other so that a value that fell in the first one is not included in the second one?

数据是否被提取了两次?(假设 sqlContext 是从数据库中进行选择，选择是否执行了两次?).

And is the data being fetched twice? (Assume the sqlContext is selecting from a DB, is the select being executed twice?).

Spark 如何跟踪 randomSplit 中的拆分? [英] How does Spark keep track of the splits in randomSplit?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 如何跟踪 randomSplit 中的拆分? [英] How does Spark keep track of the splits in randomSplit?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭