了解Spark序列化 [英] Understanding Spark serialization
问题描述
在Spark中,如何知道哪些对象在驱动程序上实例化,哪些对象在执行程序上实例化,因此如何确定哪些类需要实现Serializable?
In Spark how does one know which objects are instantiated on driver and which are instantiated on executor , and hence how does one determine which classes needs to implement Serializable ?
推荐答案
序列化对象意味着将其状态转换为字节流,以便可以将字节流还原回该对象的副本.如果Java对象的类或其任何超类实现了java.io.Serializable接口或其子接口java.io.Externalizable,则它是可序列化的.
To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object. A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subinterface, java.io.Externalizable.
一个类从不被序列化,只有一个类的对象被序列化.如果需要持久化对象或通过网络传输对象,则需要对象序列化.
A class is never serialized only object of a class is serialized. Object serialization is needed if object needs to be persisted or transmitted over the network .
Class Component Serialization
instance variable yes
Static instance variable no
methods no
Static methods no
Static inner class no
local variables no
让我们看一下示例Spark代码并经历各种场景
Let's take a sample Spark code and go through various scenarios
public class SparkSample {
public int instanceVariable =10 ;
public static int staticInstanceVariable =20 ;
public int run(){
int localVariable =30;
// create Spark conf
final SparkConf sparkConf = new SparkConf().setAppName(config.get(JOB_NAME).set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
// create spark context
final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
// read DATA
JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD();
// Anonymous class used for lambda implementation
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String s) {
// How will the listed varibles be accessed in RDD across driver and Executors
System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable);
return Arrays.asList(SPACE.split(s)).iterator();
});
// SAVE OUTPUT
words.saveAsTextFile(OUTPUT_PATH));
}
// Inner Static class for the funactional interface which can replace the lambda implementation above
public static class MapClass extends FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String s) {
System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable);
return Arrays.asList(SPACE.split(s)).iterator();
});
public static void main(String[] args) throws Exception {
JavaWordCount count = new JavaWordCount();
count.run();
}
}
内部类对象中外部类的实例变量的可访问性和可序列化性
Accessibility and Serializability of instance variable from Outer Class inside inner class objects
Inner class | Instance Variable (Outer class) | Static Instance Variable (Outer class) | Local Variable (Outer class)
Anonymous class | Accessible And Serialized | Accessible yet not Serialized | Accessible And Serialized
Inner Static class | Not Accessible | Accessible yet not Serialized | Not Accessible
了解Spark工作的经验法则是:
Rule of thumb while understanding Spark job is :
-
在RDD中编写的所有lambda函数都在驱动程序上实例化,并且对象被序列化并发送给执行者
All the lambda functions written inside the RDD are instantiated on the driver and the objects are serialized and sent to the executors
如果在内部类中访问任何外部类实例变量,则编译器将应用不同的逻辑来访问它们,因此外部类是否要序列化取决于您访问的内容.
If any outer class instance variables are accessed within the inner class, compiler apply different logic to access them, hence outer class gets serialized or not depends what do you access.
就Java而言,整个争论都是关于外部类与内部类,以及访问外部类引用和变量如何导致序列化问题.
In terms of Java, the whole debate is about Outer class vs Inner class and how does accessing outer class references and variables leads to serialization issues.
各种情况:
默认情况下,编译器会在
Compiler by default inserts constructor in the byte code of the
引用外部类对象的匿名类.
Anonymous class with reference to Outer class object .
外部类对象用于访问实例变量
The outer class object is used to access the instance variable
Anonymous-class(){
Anonymous-class(){
final Outer-class reference;
Anonymous-class( Outer-class outer-reference){
reference = outer-reference;
}
}
外部类被序列化并与 内部匿名类的序列化对象
The outer class is serialized and sent along with the serialized object of the inner anonymous class
由于静态变量未序列化,因此外部类 对象仍插入到Anonymous类构造函数中.
As static variables are not serialized , outer class object is still inserted into the Anonymous class constructor .
静态变量的值取自类状态
The value of the static variable is taken from the class state
出现在那个执行人上.
默认情况下,编译器会在
Compiler by default inserts constructor in the byte code of the
引用外部类对象和局部变量引用的匿名类.
Anonymous class with reference to Outer class object AND local variable refrence.
外部类对象用于访问实例变量
The outer class object is used to access the instance variable
Anonymous-class(){
Anonymous-class(){
final Outer-class reference;
final Local-variable localRefrence ;
Anonymous-class( Outer-class outer-reference, Local-variable localRefrence){
reference = outer-reference;
this.localRefrence = localRefrence;
}
}
外部类已序列化,局部变量对象也是
The outer class is serialized , and the local variable object is also
序列化并与内部匿名类的序列化对象一起发送
serialized and sent along with the serialized object of the inner anonymous class
当局部变量成为匿名类内的实例成员时,需要对其进行序列化.从外部类的角度来看,本地变量永远无法序列化
As the local variable becomes a instance member inside the anonymous class it needs to be serialized . From outer class perspective the local variable can never be serialized
无法访问
无法访问
由于静态变量未序列化,因此没有外部类对象被序列化.
As static variables are not serialized hence no outer class object is serialized.
静态变量的值取自类状态
The value of the static variable is taken from the class state
出现在那个执行人上.
外部类未序列化,并且与序列化的Static内部类一起发送
Outer class is not serialized and send along with the serialized Static inner class
思考的要点:
-
遵循
-
Java序列化规则,以选择需要序列化的类对象.
Java Serialization rules are followed to select which class object needs to be serialized .
使用javap -p -c"abc.class"解开字节码,然后查看编译器生成的代码
Use javap -p -c "abc.class" to unwrap the byte code and see the compiler generated code
根据您要在外部类的内部类中尝试访问的内容,编译器会生成不同的字节码.
Depending on what you are trying to access within the inner class of the outer class, compiler generates different byte code.
您不需要使类实现仅在驱动程序上访问的序列化.
You don't need to make classes implement Serialization which are only accessed on driver .
在RDD中使用的任何匿名/静态类(所有lambda函数均为匿名类)都将在驱动程序上实例化.
Any anonymous/static class (all lambda function are anonymous class) used within RDD will be instantiated on the driver .
在RDD中使用的任何类/变量都将在驱动程序上实例化并发送给执行者.
Any class/variable used inside RDD will be instantiated on driver and sent to the executors .
任何声明为瞬态的实例变量都不会在驱动程序上序列化.
Any instance variable declared transient will not be serialized on driver.
- 默认情况下,匿名类将迫使您使外部类可序列化.
- 任何局部变量/对象都不必可序列化.
- 仅当在Anonymous类中使用局部变量时才需要序列化
- 一个人可以在pair,mapToPair函数的call()方法内创建单例,从而确保它从未在驱动程序上初始化
- 静态变量从不序列化,因此从不发送 从司机到执行人
- By default Anonymous classes will force you to make the outer class serializable.
- Any local variable/object need not have to be serializable .
- Only if local variable is used inside the Anonymous class needs to be serialized
- One can create singleton inside the call() method of pair,mapToPair function , thus making sure its never initialized on driver
- static variables are never serialized hence are never sent from driver to executors
这篇关于了解Spark序列化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!