序列化和定制星火RDD类 [英] Serialization and Custom Spark RDD Class

查看：267 发布时间：2016/5/22 16:48:45 scala hadoop serialization apache-spark rdd

本文介绍了序列化和定制星火RDD类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在写在Scala中的自定义星火RDD实施，我使用的是星火壳调试我的实现。我现在的目标是获得：

I'm writing a custom Spark RDD implementation in Scala, and I'm debugging my implementation using the Spark shell. My goal for now is to get:

customRDD.count

无异常成功。现在，这是我收到：

to succeed without an Exception. Right now this is what I'm getting:

15/03/06 23:02:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/06 23:02:32 ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it.
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)

...

Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
    at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
    ... 45 more

无法序列化任务0抓住了我的注意。我没有什么事情我 customRDD.count 的杰出精神的图片，这是非常清楚确切的什么不能被序列化。

The "failed to serialize task 0" catches my attention. I don't have an outstanding mental picture of what's going on I do customRDD.count, and it's very unclear exactly what could not be serialized.

我的定制RDD包括：

自定义RDD类

自定义分区类

自定义（斯卡拉）iterator类

我的星火shell会话看起来是这样的：

My Spark shell session looks like this:

import custom.rdd.stuff
import org.apache.spark.SparkContext

val conf = sc.getConf
conf.set(custom, parameters)
sc.stop
sc2 = new SparkContext(conf)
val mapOfThings: Map[String, String] = ...
myRdd = customRDD(sc2, mapOfStuff)
myRdd.count

... (exception output) ...

我想知道的是：

有关创建自定义类RDD的目的，有什么需要序列化？

是什么意思是序列化，尽量星火是什么呢？这是类似于Java的序列化？

请所有的数据（由计算方法返回）也需要序列化？

For the purposes of creating a custom RDD class, what needs to be "serializable"?
What does it mean to be "serializable", as far as Spark is concerned? Is this akin to Java's "Serializable"?
Do all data returned from my RDD's Iterator (returned by the compute method) also need to be serializable?

感谢你这么多的任何澄清这个问题。

Thank you so much for any clarification on this issue.

序列化和定制星火RDD类 [英] Serialization and Custom Spark RDD Class

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

序列化和定制星火RDD类 [英] Serialization and Custom Spark RDD Class

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭