Scala Spark-任务不可序列化 [英] Scala Spark - task not serializable

查看:85
本文介绍了Scala Spark-任务不可序列化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码,故障出在sc.parallelize()

I have the following code, where the fault is at sc.parallelize()

val pairs = ret.cartesian(ret)
    .map {
        case ((k1, v1), (k2, v2)) => ((k1, k2), (v1.toList, v2.toList))
    }
for (pair <- pairs) {
    val test = sc.parallelize(pair._2._1.map(_._1 ))
}

哪里

  • k1,k2是字符串
  • v1,v2是双打列表

每当尝试访问sc时,都会出现以下错误.我在这里做什么错了?

I am getting the following error whenever I try to access sc. What am I doing wrong here?

线程"main"中的异常org.apache.spark.SparkException:任务不可序列化在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:315)在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:305)中在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:132)在org.apache.spark.SparkContext.clean(SparkContext.scala:1893)在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1.apply(RDD.scala:869)在org.apache.spark.rdd.RDD $$ anonfun $ foreach $ 1.apply(RDD.scala:868)在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:147)在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:108)在org.apache.spark.rdd.RDD.withScope(RDD.scala:286)在org.apache.spark.rdd.RDD.foreach(RDD.scala:868)在CorrelationCalc $ .main(CorrelationCalc.scala:33)在CorrelationCalc.main(CorrelationCalc.scala)在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)在java.lang.reflect.Method.invoke(Method.java:606)在org.apache.spark.deploy.SparkSubmit $ .org $ apache $ spark $ deploy $ SparkSubmit $$ runMain(SparkSubmit.scala:665)在org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1(SparkSubmit.scala:170)在org.apache.spark.deploy.SparkSubmit $ .submit(SparkSubmit.scala:193)在org.apache.spark.deploy.SparkSubmit $ .main(SparkSubmit.scala:112)在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)引起原因:java.io.NotSerializableException:org.apache.spark.SparkContext序列化堆栈:-无法序列化的对象(类:org.apache.spark.SparkContext,值:org.apache.spark.SparkContext@40bee8c5)-字段(类:CorrelationCalc $$ anonfun $ main $ 1,名称:sc $ 1,类型:类org.apache.spark.SparkContext)-对象(类CorrelationCalc $$ anonfun $ main $ 1,)在org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)处在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:312)...另外20个

Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1893) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:869) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:868) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.foreach(RDD.scala:868) at CorrelationCalc$.main(CorrelationCalc.scala:33) at CorrelationCalc.main(CorrelationCalc.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext Serialization stack: - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@40bee8c5) - field (class: CorrelationCalc$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext) - object (class CorrelationCalc$$anonfun$main$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312) ... 20 more

推荐答案

理解力只是在做pairs.map()

The for-comprehension is just doing a pairs.map()

RDD操作是由工作人员执行的,要使他们完成工作,您发送给他们的任何内容都必须可序列化.SparkContext附加到主服务器:它负责管理整个集群.

RDD operations are performed by the workers and to have them do that work, anything you send to them must be serializable. The SparkContext is attached to the master: it is responsible for managing the entire cluster.

如果要创建RDD,则必须了解整个群集(这是第二个"D" ---分布式),因此不能在辅助服务器上创建新的RDD.而且您可能根本不想将成对的每一行都变成一个RDD(并且每个都有相同的名称!).

If you want to create an RDD, you have to be aware of the whole cluster (that's the 2nd "D" --- distributed) so you can't create a new RDD on the workers. And you probably don't want to turn each row in pairs into an RDD (and each with the same name!) anyway.

从代码中很难说出您想做什么,但可能看起来像

It's difficult to tell from your code what you'd like to do, but it will probably look like

val test = pairs.map( r => r._2._1) 

这将是RDD,其中每一行都是v1.toList的

which would be an RDD where each row is whatever was in the v1.toList's

这篇关于Scala Spark-任务不可序列化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆