Spark Scala:任务无法序列化错误 [英] Spark Scala: Task Not serializable error

查看:250
本文介绍了Spark Scala:任务无法序列化错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将IntelliJ Community Edition与Scala插件和spark库一起使用.我仍在学习Spark,并且正在使用Scala工作表.

I am using IntelliJ Community Edition with Scala Plugin and spark libraries. I am still learning Spark and am using Scala Worksheet.

我编写了以下代码,该代码删除了字符串中的标点符号:

I have written the below code which removes punctuation marks in a String:

def removePunctuation(text: String): String = {
  val punctPattern = "[^a-zA-Z0-9\\s]".r
  punctPattern.replaceAllIn(text, "").toLowerCase
}

然后我阅读了一个文本文件并尝试删除标点符号:

Then I read a text file and try to remove punctuation:

val myfile = sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)

这会产生如下错误,请提供任何帮助:

This gives error as below, any help would be appreciated:

org.apache.spark.SparkException:任务不可序列化 在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(/home/ubuntu/src/main/scala/Test.sc:294) 在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(/home/ubuntu/src/main/scala/Test.sc:284) 在org.apache.spark.util.ClosureCleaner $ .clean(/home/ubuntu/src/main/scala/Test.sc:104) 在org.apache.spark.SparkContext.clean(/home/ubuntu/src/main/scala/Test.sc:2090) 在org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(/home/ubuntu/src/main/scala/Test.sc:366) 在org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(/home/ubuntu/src/main/scala/Test.sc:365) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(/home/ubuntu/src/main/scala/Test.sc:147) 在#worksheet#.#worksheet#(/home/ubuntu/src/main/scala/Test.sc:108) 引起原因:java.io.NotSerializableException:A $ A21 $ A $ A21 序列化堆栈: -对象不可序列化(类:A $ A21 $ A $ A21,值:A $ A21 $ A $ A21 @ 62db3891) -字段(类:A $ A21 $ A $ A21 $$ anonfun $ words $ 1,名称:$ outer,类型:class A $ A21 $ A $ A21) -object(类别A $ A21 $ A $ A21 $$ anonfun $ words $ 1,) 在org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)处 在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) 在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) 在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:295) 在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:288)中 在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:108) 在org.apache.spark.SparkContext.clean(SparkContext.scala:2094) 在org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:370) 在org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:369) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 在org.apache.spark.rdd.RDD.withScope(RDD.scala:362) 在org.apache.spark.rdd.RDD.map(RDD.scala:369) 在A $ A21 $ A $ A21.words $ lzycompute(Test.sc:27) at A $ A21 $ A $ A21.words(Test.sc:27) at A $ A21 $ A $ A21.get $$ instance $$ words(Test.sc:27) 在A $ A21 $ .main(Test.sc:73) 在A $ A21.main(Test.sc) 在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处 在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498) 在org.jetbrains.plugins.scala.worksheet.MyWorksheetRunner.main(MyWorksheetRunner.java:22)

org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(/home/ubuntu/src/main/scala/Test.sc:294) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(/home/ubuntu/src/main/scala/Test.sc:284) at org.apache.spark.util.ClosureCleaner$.clean(/home/ubuntu/src/main/scala/Test.sc:104) at org.apache.spark.SparkContext.clean(/home/ubuntu/src/main/scala/Test.sc:2090) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:366) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(/home/ubuntu/src/main/scala/Test.sc:365) at org.apache.spark.rdd.RDDOperationScope$.withScope(/home/ubuntu/src/main/scala/Test.sc:147) at #worksheet#.#worksheet#(/home/ubuntu/src/main/scala/Test.sc:108) Caused by: java.io.NotSerializableException: A$A21$A$A21 Serialization stack: - object not serializable (class: A$A21$A$A21, value: A$A21$A$A21@62db3891) - field (class: A$A21$A$A21$$anonfun$words$1, name: $outer, type: class A$A21$A$A21) - object (class A$A21$A$A21$$anonfun$words$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.map(RDD.scala:369) at A$A21$A$A21.words$lzycompute(Test.sc:27) at A$A21$A$A21.words(Test.sc:27) at A$A21$A$A21.get$$instance$$words(Test.sc:27) at A$A21$.main(Test.sc:73) at A$A21.main(Test.sc) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.jetbrains.plugins.scala.worksheet.MyWorksheetRunner.main(MyWorksheetRunner.java:22)

推荐答案

如@TGaweda所建议,Spark的SerializationDebugger对于识别从给定对象到有问题的对象的序列化路径"非常有用.堆栈跟踪中序列化堆栈"之前的所有美元符号都表明问题出在您的方法的容器对象上.

As @TGaweda suggests, Spark's SerializationDebugger is very helpful for identifying "the serialization path leading from the given object to the problematic object." All the dollar signs before the "Serialization stack" in the stack trace indicate that the container object for your method is the problem.

虽然最简单的方法是在容器类上打Serializable,但我更喜欢利用Scala是一种功能语言并将您的功能用作一等公民的事实:

While it is easiest to just slap Serializable on your container class, I prefer to take advantage of the fact Scala is a functional language and use your function as a first class citizen:

sc.textFile("/home/ubuntu/data.txt",4).map { text =>
  val punctPattern = "[^a-zA-Z0-9\\s]".r
  punctPattern.replaceAllIn(text, "").toLowerCase
}

或者,如果您真的想使事情分开:

Or if you really want to keep things separate:

val removePunctuation: String => String = (text: String) => {
  val punctPattern = "[^a-zA-Z0-9\\s]".r
  punctPattern.replaceAllIn(text, "").toLowerCase
}
sc.textFile("/home/ubuntu/data.txt",4).map(removePunctuation)

Regex 是可序列化的,您应该确认.

在第二点但很重要的一点上,构造Regex的成本很高,因此出于性能考虑,请将其排除在转换之外-可能是

On a secondary but very important note, constructing a Regex is expensive, so factor it out of your transformations for the sake of performance--possibly with a broadcast.

这篇关于Spark Scala:任务无法序列化错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆