如何使用字节数组作为RDD中的键? [英] How to use byte array as key in RDD?
问题描述
我想使用Array [Byte]作为RDD中的键.例如:
I want to use Array[Byte] as Key from RDD. For example:
val rdd1:RDD[((Array[Byte]), (String, Int)] = from src rdd
val rdd2:RDD[((Array[Byte]), (String, Int)] = from dest rdd
val resultRdd = rdd1.join(rdd2)
我想使用Array [Byte]作为键在rdd1和rdd2上执行联接操作 但总是得到resultRdd.count()=0.
I want to perform join operation on rdd1 and rdd2 using Array[Byte] as Key but always getting resultRdd.count() = 0.
因此,我尝试序列化Array [Byte],并且我希望看到它工作正常,仅适用于小型数据集.
So I tried to serialize the Array[Byte] and It is working fine as I want to see, only for small Dataset.
val serRdd1= rdd1.map { case (k,v) => (new SerByteArr(k), v) }
val serRdd2= rdd2.map { case (k,v) => (new SerByteArr(k), v) }
class SerByteArr(val bytes: Array[Byte]) extends Serializable {
override val hashCode = bytes.deep.hashCode
override def equals(obj:Any) = obj.isInstanceOf[SerByteArr] && obj.asInstanceOf[SerByteArr].bytes.deep == this.bytes.deep
}
对于大型数据集,获取java.lang.OutOfMemoryError:超出了GC开销限制, 创建对象(新的SerByteArr(k))时出现问题.
For Large dataset, getting java.lang.OutOfMemoryError: GC overhead limit exceeded, Problem is occuring in creating the object(new SerByteArr(k)).
如何避免GC限制超出错误.有人帮我吗?
How to avoid the GC limit exceed error. Does anyone help me?
推荐答案
您可以对数组WrappedArray[Byte]
使用内置的scala包装器.可以使用toSeq方法将数组转换为WrappedArray. WrappedArray已正确实现equals
和hashCode
,因此具有相同元素的两个不同数组被视为相等.
You can use a built-in scala wrapper for arrays, WrappedArray[Byte]
. An array can be converted to a WrappedArray by using toSeq method. WrappedArray has properly implemented equals
and hashCode
, so two different arrays with the same elements are considered as equal.
scala> val a = Array(1,2,3,4,5)
a: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val b = Array(1,2,3,4,5)
b: Array[Int] = Array(1, 2, 3, 4, 5)
scala> a == b
res0: Boolean = false
scala> a.toSeq
res1: Seq[Int] = WrappedArray(1, 2, 3, 4, 5)
scala> a.toSeq == b.toSeq
res2: Boolean = true
这篇关于如何使用字节数组作为RDD中的键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!