如何通过Spark中的键对RDD进行分区? [英] How to partition RDD by key in Spark?

查看:367
本文介绍了如何通过Spark中的键对RDD进行分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于HashPartitioner文档说:

Given that the HashPartitioner docs say:

[HashPartitioner]使用Java的实现了基于哈希的分区 Object.hashCode.

[HashPartitioner] implements hash-based partitioning using Java's Object.hashCode.

说我想用kindDeviceData进行分区.

case class DeviceData(kind: String, time: Long, data: String)

通过覆盖deviceData.hashCode()方法对RDD[DeviceData]进行分区并仅使用kind的哈希码是否正确?

Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?

但是考虑到HashPartitioner带有多个分区参数,我对于是否需要提前知道种类数以及如果种类多于分区会发生什么感到困惑?

But given that HashPartitioner takes a number of partitions parameter I am confused as to whether I need to know the number of kinds in advance and what happens if there are more kinds than partitions?

如果我将分区数据写到磁盘上,它将在读取时保持分区状态是正确的吗?

Is it correct that if I write partitioned data to disk it will stay partitioned when read?

我的目标是打电话

  deviceDataRdd.foreachPartition(d: Iterator[DeviceData] => ...)

并且迭代器中只有具有相同kind值的DeviceData.

And have only DeviceData's of the same kind value in the iterator.

推荐答案

仅使用kind进行groupByKey怎么样.或其他PairRDDFunctions方法.

How about just doing a groupByKey using kind. Or another PairRDDFunctions method.

在我看来,您似乎并不真正在乎分区,只是在一个处理流程中获得了所有特定种类的内容?

You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?

pair函数允许这样做:

The pair functions allow this:

rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
   .foreachPartition(...)

但是,您可能会更安全一些:

However, you can probably be a little safer with something more like:

rdd.keyBy(_.kind).reduceByKey(....)

mapValues或许多其他成对函数,可确保您整体获得零件

or mapValues or a number of the other pair functions that guarantee you get the pieces as a whole

这篇关于如何通过Spark中的键对RDD进行分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆