如何定义自定义分区的大小相等的分区所在每个分区都有的元素数量相等的星火RDDS? [英] How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?
问题描述
我是新来的火花。我有一个大的数据集的元素[RDD]的,我希望把它分成维护元素的二阶完全相等大小的分区。我试着用 RangePartitioner
像
I am new to Spark. I have a large dataset of elements[RDD] and I want to divide it into two exactly equal sized partitions maintaining order of elements. I tried using RangePartitioner
like
var data = partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile))
由于它大致分为但不完全相等大小的维护元素的次序这不给一个满意的结果。
例如,如果有64个元素,我们使用 Rangepartitioner
,然后将其分为31元和33元。
This doesn't give a satisfactory result because it divides roughly but not exactly equal sized maintaining order of elements.
For example if there are 64 elements, we use
Rangepartitioner
, then it divides into 31 elements and 33 elements.
我需要一个分区,这样我得到完全相同首先在一半,另一半则包含第二组32个元素的32个元素。
能否请你帮助我,建议如何使用自定义分区,这样我得到同样大小的两半,维护元素的顺序?
I need a partitioner such that I get exactly first 32 elements in one half and other half contains second set of 32 elements. Could you please help me by suggesting how to use a customized partitioner such that I get equally sized two halves, maintaining the order of elements?
推荐答案
分区
S按指定一键分区的工作。您需要密钥分发的先验知识,或者看一下所有的键,做出这样的分区。这就是为什么星火不会为你提供。
Partitioner
s work by assigning a key to a partition. You would need prior knowledge of the key distribution, or look at all keys, to make such a partitioner. This is why Spark does not provide you with one.
在一般你不需要这样的分区。事实上,我不能拿出一个用例在那里我需要大小相等的分区。如果元件的数目为奇数
In general you do not need such a partitioner. In fact I cannot come up with a use case where I would need equal-size partitions. What if the number of elements is odd?
不管怎样,让我们说你有通过连续内部
取值键入一个RDD,你知道有多少总。然后,你可以写一个自定义分区
是这样的:
Anyway, let us say you have an RDD keyed by sequential Int
s, and you know how many in total. Then you could write a custom Partitioner
like this:
class ExactPartitioner[V](
partitions: Int,
elements: Int)
extends Partitioner {
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[Int]
// `k` is assumed to go continuously from 0 to elements-1.
return k * partitions / elements
}
}
这篇关于如何定义自定义分区的大小相等的分区所在每个分区都有的元素数量相等的星火RDDS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!