如何火花RDD.randomSplit实际上分裂RDD [英] How does Sparks RDD.randomSplit actually split the RDD
问题描述
所以,假设香港专业教育学院有一个RDD 3000行。 2000年的第一行是1级和1000最后行是类class2。
该RDD横跨100个分区分区。
So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions.
当调用RDD.randomSplit(0.8,0.2)
When calling RDD.randomSplit(0.8,0.2)
请问函数还洗牌RDD?我们不会分裂只是不断的RDD抽样20%?抑或是选择分区的20%随机?
Does the function also shuffle the rdd? Our does the splitting simply sample 20% continuously of the rdd? Or does it select 20% of the partitions randomly?
理想情况下一样所得分裂id来具有相同的类分配作为原始RDD。即2:1
Ideally id like the resulting splits to have the same class distribution as the original RDD. ie 2:1
感谢
推荐答案
对于每个范围内重定义
阵中还有一个单独的 mapPartitionsWithIndex
其中preserves分区改造。
For each range defined by weights
array there is a separate mapPartitionsWithIndex
transformation which preserves partitioning.
每个分区使用 BernoulliCellSampler
采样。它遍历给定分区中的元素,并且选择项目,如果下一个随机双
的值是在由标准化的权重限定的给定范围内。这意味着:
Each partition is sampled using a BernoulliCellSampler
. It iterates over the elements of a given partition and selects item if value of the next random Double
is in a given range defined by normalized weights. It means it:
- 不洗牌RDD
- 不采取其他的连续块不是偶然
- 采用随机抽样每个分区
这篇关于如何火花RDD.randomSplit实际上分裂RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!