序列化和定制星火RDD类 [英] Serialization and Custom Spark RDD Class

查看:267
本文介绍了序列化和定制星火RDD类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在写在Scala中的自定义星火RDD实施,我使用的是星火壳调试我的实现。我现在的目标是获得:

I'm writing a custom Spark RDD implementation in Scala, and I'm debugging my implementation using the Spark shell. My goal for now is to get:

customRDD.count

无异常成功。现在,这是我收到:

to succeed without an Exception. Right now this is what I'm getting:

15/03/06 23:02:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/06 23:02:32 ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it.
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)

...

Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
    at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
    ... 45 more

无法序列化任务0抓住了我的注意。我没有什么事情我 customRDD.count 的杰出精神的图片,这是非常清楚确切的什么不能被序列化。

The "failed to serialize task 0" catches my attention. I don't have an outstanding mental picture of what's going on I do customRDD.count, and it's very unclear exactly what could not be serialized.

我的定制RDD包括:


  • 自定义RDD类

  • 自定义分区类

  • 自定义(斯卡拉)iterator类

我的星火shell会话看起来是这样的:

My Spark shell session looks like this:

import custom.rdd.stuff
import org.apache.spark.SparkContext

val conf = sc.getConf
conf.set(custom, parameters)
sc.stop
sc2 = new SparkContext(conf)
val mapOfThings: Map[String, String] = ...
myRdd = customRDD(sc2, mapOfStuff)
myRdd.count

... (exception output) ...

我想知道的是:


  • 有关创建自定义类RDD的目的,有什么需要序列化?

  • 是什么意思是序列化,尽量星火是什么呢?这是类似于Java的序列化?

  • 从我RDD的迭代器返回
  • 请所有的数据(由计算方法返回)也需要序列化?

  • For the purposes of creating a custom RDD class, what needs to be "serializable"?
  • What does it mean to be "serializable", as far as Spark is concerned? Is this akin to Java's "Serializable"?
  • Do all data returned from my RDD's Iterator (returned by the compute method) also need to be serializable?

感谢你这么多的任何澄清这个问题。

Thank you so much for any clarification on this issue.

推荐答案

在除了肯尼的解释,我建议你打开系列化调试,看看有什么造成问题的原因。通常,这是从人的角度不可能只通过查看code弄清楚。

In addition to Kenny's explanation, I would suggest you turn on serialization debugging to see what's causing the problem. Often it's humanly impossible to figure out just by looking at the code.

-Dsun.io.serialization.extendedDebugInfo=true

这篇关于序列化和定制星火RDD类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆