Spark HashPartitioner意外分区 [英] Spark HashPartitioner Unexpected Partitioning

查看:145
本文介绍了Spark HashPartitioner意外分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用HashPartioner,但是得到了意外的结果. 我正在使用3个不同的String作为键,并将partition参数设置为3,所以我希望有3个分区.

I am using HashPartioner but getting an unexpected result. I am using 3 different String as keys, and giving partition parameter as 3, so I expect 3 partitions.

val cars = Array("Honda", "Toyota", "Kia")

val carnamePrice = sc.parallelize(for {
 x <- cars
 y <- Array(100,200,300)
} yield (x, y), 8)
val rddEachCar = carnamePrice.partitionBy(new HashPartitioner(3))
val mapped =   rddEachCar.mapPartitionsWithIndex{
                (index, iterator) => {
                   println("Called in Partition -> " + index)
                   val myList = iterator.toList

                   myList.map(x => x + " -> " + index).iterator
                }
             }
mapped.take(10)

结果如下.它仅提供2个分区.我检查了字符串的哈希码 (69909220 75427 -1783892706).这可能是什么问题?可能是我误解了分区算法.

The result is below. It gives only 2 partitions. I checked hash codes for String (69909220 75427 -1783892706). What could be problem here? Probably I misunderstood partitioning algorithm.

Array[String] = Array((Toyota,100) -> 0, (Toyota,200) -> 0, (Toyota,300) -> 0, (Honda,100) -> 1, (Honda,200) -> 1, (Honda,300) -> 1, (Kia,100) -> 1, (Kia,200) -> 1, (Kia,300) -> 1)

推荐答案

这里没有什么奇怪的事情.

There is nothing strange going on here. Utils.nonNegativeMod, which is used by HashPartitioner is implemented as follows:

def nonNegativeMod(x: Int, mod: Int): Int = {
  val rawMod = x % mod
  rawMod + (if (rawMod < 0) mod else 0)
}

具有3个分区,密钥分布定义如下:

With 3 partitions the key distribution is defined as shown below:

for { car <- Seq("Honda", "Toyota", "Kia") } 
  yield (car -> nonNegativeMod(car.hashCode, 3))

Seq[(String, Int)] = List((Honda,1), (Toyota,0), (Kia,1))

这正是您遇到的情况.换句话说,缺少直接的哈希冲突并不能保证缺少以任意数为模的冲突.

which is exactly what you get in your case. In other words, lack of direct hash collision doesn't guarantee lack of collision modulo an arbitrary number.

这篇关于Spark HashPartitioner意外分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆