序列化和定制星火RDD类 [英] Serialization and Custom Spark RDD Class
问题描述
我在写在Scala中的自定义星火RDD实施,我使用的是星火壳调试我的实现。我现在的目标是获得:
I'm writing a custom Spark RDD implementation in Scala, and I'm debugging my implementation using the Spark shell. My goal for now is to get:
customRDD.count
无异常成功。现在,这是我收到:
to succeed without an Exception. Right now this is what I'm getting:
15/03/06 23:02:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/06 23:02:32 ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
... 45 more
无法序列化任务0抓住了我的注意。我没有什么事情我 customRDD.count
的杰出精神的图片,这是非常清楚确切的什么不能被序列化。
The "failed to serialize task 0" catches my attention. I don't have an outstanding mental picture of what's going on I do customRDD.count
, and it's very unclear exactly what could not be serialized.
我的定制RDD包括:
- 自定义RDD类
- 自定义分区类
- 自定义(斯卡拉)iterator类
我的星火shell会话看起来是这样的:
My Spark shell session looks like this:
import custom.rdd.stuff
import org.apache.spark.SparkContext
val conf = sc.getConf
conf.set(custom, parameters)
sc.stop
sc2 = new SparkContext(conf)
val mapOfThings: Map[String, String] = ...
myRdd = customRDD(sc2, mapOfStuff)
myRdd.count
... (exception output) ...
我想知道的是:
- 有关创建自定义类RDD的目的,有什么需要序列化?
- 是什么意思是序列化,尽量星火是什么呢?这是类似于Java的序列化?
- 请所有的数据(由
计算
方法返回)也需要序列化?
从我RDD的迭代器返回
- For the purposes of creating a custom RDD class, what needs to be "serializable"?
- What does it mean to be "serializable", as far as Spark is concerned? Is this akin to Java's "Serializable"?
- Do all data returned from my RDD's Iterator (returned by the
compute
method) also need to be serializable?
感谢你这么多的任何澄清这个问题。
Thank you so much for any clarification on this issue.
推荐答案
在除了肯尼的解释,我建议你打开系列化调试,看看有什么造成问题的原因。通常,这是从人的角度不可能只通过查看code弄清楚。
In addition to Kenny's explanation, I would suggest you turn on serialization debugging to see what's causing the problem. Often it's humanly impossible to figure out just by looking at the code.
-Dsun.io.serialization.extendedDebugInfo=true
这篇关于序列化和定制星火RDD类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!