如何通过Spark中的键对RDD进行分区? [英] How to partition RDD by key in Spark?
问题描述
鉴于HashPartitioner文档说:
Given that the HashPartitioner docs say:
[HashPartitioner]使用Java的实现了基于哈希的分区 Object.hashCode.
[HashPartitioner] implements hash-based partitioning using Java's Object.hashCode.
说我想用kind
对DeviceData
进行分区.
case class DeviceData(kind: String, time: Long, data: String)
通过覆盖deviceData.hashCode()
方法对RDD[DeviceData]
进行分区并仅使用kind
的哈希码是否正确?
Would it be correct to partition an RDD[DeviceData]
by overwriting the deviceData.hashCode()
method and use only the hashcode of kind
?
但是考虑到HashPartitioner
带有多个分区参数,我对于是否需要提前知道种类数以及如果种类多于分区会发生什么感到困惑?
But given that HashPartitioner
takes a number of partitions parameter I am confused as to whether I need to know the number of kinds in advance and what happens if there are more kinds than partitions?
如果我将分区数据写到磁盘上,它将在读取时保持分区状态是正确的吗?
Is it correct that if I write partitioned data to disk it will stay partitioned when read?
我的目标是打电话
deviceDataRdd.foreachPartition(d: Iterator[DeviceData] => ...)
并且迭代器中只有具有相同kind
值的DeviceData
.
And have only DeviceData
's of the same kind
value in the iterator.
推荐答案
仅使用kind
进行groupByKey
怎么样.或其他PairRDDFunctions
方法.
How about just doing a groupByKey
using kind
. Or another PairRDDFunctions
method.
在我看来,您似乎并不真正在乎分区,只是在一个处理流程中获得了所有特定种类的内容?
You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?
pair函数允许这样做:
The pair functions allow this:
rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
.foreachPartition(...)
但是,您可能会更安全一些:
However, you can probably be a little safer with something more like:
rdd.keyBy(_.kind).reduceByKey(....)
或mapValues
或许多其他成对函数,可确保您整体获得零件
or mapValues
or a number of the other pair functions that guarantee you get the pieces as a whole
这篇关于如何通过Spark中的键对RDD进行分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!