Spark Scala:任务不可序列化错误 [英] Spark Scala: Task Not serializable error

查看:23
本文介绍了Spark Scala:任务不可序列化错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有 Scala 插件和 spark 库的 IntelliJ 社区版.我仍在学习 Spark 并且正在使用 Scala Worksheet.

I am using IntelliJ Community Edition with Scala Plugin and spark libraries. I am still learning Spark and am using Scala Worksheet.

我编写了以下代码,用于删除字符串中的标点符号:

I have written the below code which removes punctuation marks in a String:

def removePunctuation(text: String): String = {
  val punctPattern = "[^a-zA-Z0-9\s]".r
  punctPattern.replaceAllIn(text, "").toLowerCase
}

然后我阅读了一个文本文件并尝试删除标点符号:

Then I read a text file and try to remove punctuation:

val myfile = sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)

这给出了如下错误,任何帮助将不胜感激:

This gives error as below, any help would be appreciated:

org.apache.spark.SparkException:任务不可序列化在 org.apache.spark.util.ClosureCleaner$.ensureSerializable(/home/ubuntu/src/main/scala/Test.sc:294)在 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(/home/ubuntu/src/main/scala/Test.sc:284)在 org.apache.spark.util.ClosureCleaner$.clean(/home/ubuntu/src/main/scala/Test.sc:104)在 org.apache.spark.SparkContext.clean(/home/ubuntu/src/main/scala/Test.sc:2090)在 org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:366)在 org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:365)在 org.apache.spark.rdd.RDDOOperationScope$.withScope(/home/ubuntu/src/main/scala/Test.sc:147)在#worksheet#.#worksheet#(/home/ubuntu/src/main/scala/Test.sc:108)引起:java.io.NotSerializableException:A$A21$A$A21序列化堆栈:- 对象不可序列化(类:A$A21$A$A21,值:A$A21$A$A21@62db3891)- 字段(类:A$A21$A$A21$$anonfun$words$1,名称:$outer,类型:类 A$A21$A$A21)- 对象(类 A$A21$A$A21$$anonfun$words$1, )在 org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)在 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)在 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)在 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)在 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)在 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)在 org.apache.spark.SparkContext.clean(SparkContext.scala:2094)在 org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)在 org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)在 org.apache.spark.rdd.RDD.withScope(RDD.scala:362)在 org.apache.spark.rdd.RDD.map(RDD.scala:369)在 A$A21$A$A21.words$lzycompute(Test.sc:27)at A$A21$A$A21.words(Test.sc:27)在 A$A21$A$A21.get$$instance$$words(Test.sc:27)在 A$A21$.main(Test.sc:73)在 A$A21.main(Test.sc)在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)在 java.lang.reflect.Method.invoke(Method.java:498)在 org.jetbrains.plugins.scala.worksheet.MyWorksheetRunner.main(MyWorksheetRunner.java:22)

org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(/home/ubuntu/src/main/scala/Test.sc:294) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(/home/ubuntu/src/main/scala/Test.sc:284) at org.apache.spark.util.ClosureCleaner$.clean(/home/ubuntu/src/main/scala/Test.sc:104) at org.apache.spark.SparkContext.clean(/home/ubuntu/src/main/scala/Test.sc:2090) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:366) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:365) at org.apache.spark.rdd.RDDOperationScope$.withScope(/home/ubuntu/src/main/scala/Test.sc:147) at #worksheet#.#worksheet#(/home/ubuntu/src/main/scala/Test.sc:108) Caused by: java.io.NotSerializableException: A$A21$A$A21 Serialization stack: - object not serializable (class: A$A21$A$A21, value: A$A21$A$A21@62db3891) - field (class: A$A21$A$A21$$anonfun$words$1, name: $outer, type: class A$A21$A$A21) - object (class A$A21$A$A21$$anonfun$words$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.map(RDD.scala:369) at A$A21$A$A21.words$lzycompute(Test.sc:27) at A$A21$A$A21.words(Test.sc:27) at A$A21$A$A21.get$$instance$$words(Test.sc:27) at A$A21$.main(Test.sc:73) at A$A21.main(Test.sc) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.jetbrains.plugins.scala.worksheet.MyWorksheetRunner.main(MyWorksheetRunner.java:22)

推荐答案

正如@TGaweda 所建议的,Spark 的 SerializationDebugger 非常有助于识别从给定对象到有问题的对象的序列化路径."堆栈跟踪中序列化堆栈"之前的所有美元符号表示您的方法的容器对象是问题所在.

As @TGaweda suggests, Spark's SerializationDebugger is very helpful for identifying "the serialization path leading from the given object to the problematic object." All the dollar signs before the "Serialization stack" in the stack trace indicate that the container object for your method is the problem.

虽然在容器类上使用 Serializable 最简单,但我更喜欢利用 Scala 是一种函数式语言这一事实​​,并将您的函数用作一等公民:

While it is easiest to just slap Serializable on your container class, I prefer to take advantage of the fact Scala is a functional language and use your function as a first class citizen:

sc.textFile("/home/ubuntu/data.txt",4).map { text =>
  val punctPattern = "[^a-zA-Z0-9\s]".r
  punctPattern.replaceAllIn(text, "").toLowerCase
}

或者如果你真的想把事情分开:

Or if you really want to keep things separate:

val removePunctuation: String => String = (text: String) => {
  val punctPattern = "[^a-zA-Z0-9\s]".r
  punctPattern.replaceAllIn(text, "").toLowerCase
}
sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)

这些选项当然有效,因为 Regex 是可序列化的,您应该确认.

These options work of course since Regex is serializable as you should confirm.

在次要但非常重要的注意事项中,构建 Regex 的成本很高,因此为了性能起见,将其排除在转换之外 - 可能使用 广播.

On a secondary but very important note, constructing a Regex is expensive, so factor it out of your transformations for the sake of performance--possibly with a broadcast.

这篇关于Spark Scala:任务不可序列化错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆