理解 Spark 序列化 [英] Understanding Spark serialization

查看:32
本文介绍了理解 Spark 序列化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Spark 中,如何知道哪些对象在 driver 上实例化,哪些在 executor 上实例化,从而确定哪些类需要实现 Serializable ?

In Spark how does one know which objects are instantiated on driver and which are instantiated on executor , and hence how does one determine which classes needs to implement Serializable ?

推荐答案

序列化对象意味着将其状态转换为字节流,以便将字节流恢复为对象的副本.如果 Java 对象的类或其任何超类实现了 java.io.Serializable 接口或其子接口 java.io.Externalizable,则该对象是可序列化的.

To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object. A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subinterface, java.io.Externalizable.

一个类永远不会被序列化,只有一个类的对象才会被序列化.如果对象需要持久化或通过网络传输,则需要对象序列化.

A class is never serialized only object of a class is serialized. Object serialization is needed if object needs to be persisted or transmitted over the network .

Class Component            Serialization
instance variable           yes
Static instance variable    no
methods                     no
Static methods              no
Static inner class          no
local variables             no

让我们以一个示例 Spark 代码并经历各种场景

Let's take a sample Spark code and go through various scenarios

public class SparkSample {

      public int instanceVariable                =10 ;
      public static int staticInstanceVariable   =20 ;

      public int run(){

         int localVariable                       =30;

         // create Spark conf
         final SparkConf sparkConf = new SparkConf().setAppName(config.get(JOB_NAME).set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

         // create spark context 
         final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

        // read DATA 
        JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD(); 


        // Anonymous class used for lambda implementation
        JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
                @Override
                public Iterator<String> call(String s) {
                // How will the listed varibles be accessed in RDD across driver and Executors 
                System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable);
                return Arrays.asList(SPACE.split(s)).iterator();
        });

        // SAVE OUTPUT
        words.saveAsTextFile(OUTPUT_PATH));

      }

       // Inner Static class for the funactional interface which can replace the lambda implementation above 
       public static class MapClass extends FlatMapFunction<String, String>() {
                @Override
                public Iterator<String> call(String s) {
                System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable);
                return Arrays.asList(SPACE.split(s)).iterator();
        }); 

        public static void main(String[] args) throws Exception {
            JavaWordCount count = new JavaWordCount();
            count.run();
        }
}

内部类对象中外部类的实例变量的可访问性和可序列化性

Accessibility and Serializability of instance variable from Outer Class inside inner class objects

    Inner class        | Instance Variable (Outer class) | Static Instance Variable (Outer class) | Local Variable (Outer class)

    Anonymous class    | Accessible And Serialized       | Accessible yet not Serialized          | Accessible And Serialized 

    Inner Static class | Not Accessible                  | Accessible yet not Serialized          | Not Accessible 

理解 Spark 工作的经验法则是:

Rule of thumb while understanding Spark job is :

  1. 写在 RDD 中的所有 lambda 函数都在驱动程序上实例化,对象被序列化并发送给执行程序

  1. All the lambda functions written inside the RDD are instantiated on the driver and the objects are serialized and sent to the executors

如果在内部类中访问任何外部类实例变量,编译器会应用不同的逻辑来访问它们,因此外部类是否被序列化取决于您访问的内容.

If any outer class instance variables are accessed within the inner class, compiler apply different logic to access them, hence outer class gets serialized or not depends what do you access.

就 Java 而言,整个争论都是关于外部类与内部类以及访问外部类引用和变量如何导致序列化问题.

In terms of Java, the whole debate is about Outer class vs Inner class and how does accessing outer class references and variables leads to serialization issues.

各种场景:

编译器默认在字节码中插入构造函数

Compiler by default inserts constructor in the byte code of the

引用外部类对象的匿名类.

Anonymous class with reference to Outer class object .

外部类对象用于访问实例变量

The outer class object is used to access the instance variable

匿名类(){

 final Outer-class reference;

 Anonymous-class( Outer-class outer-reference){

reference = outer-reference;

}

}

外部类被序列化并与内部匿名类的序列化对象

The outer class is serialized and sent along with the serialized object of the inner anonymous class

由于静态变量没有序列化,外部类对象仍然插入到匿名类构造函数中.

As static variables are not serialized , outer class object is still inserted into the Anonymous class constructor .

静态变量的值取自类状态

The value of the static variable is taken from the class state

出现在那个执行者身上.

present on that executor .

编译器默认在字节码中插入构造函数

Compiler by default inserts constructor in the byte code of the

引用外部类对象和局部变量引用的匿名类.

Anonymous class with reference to Outer class object AND local variable refrence.

外部类对象用于访问实例变量

The outer class object is used to access the instance variable

匿名类(){

 final Outer-class reference;

final Local-variable localRefrence ;

 Anonymous-class( Outer-class outer-reference, Local-variable localRefrence){

reference = outer-reference;

this.localRefrence = localRefrence;

}

}

外层类是序列化的,局部变量对象也是

The outer class is serialized , and the local variable object is also

序列化后与内部匿名类的序列化对象一起发送

serialized and sent along with the serialized object of the inner anonymous class

当局部变量成为匿名类中的一个实例成员时,它需要被序列化.从外部类的角度来看,局部变量永远不能被序列化

As the local variable becomes a instance member inside the anonymous class it needs to be serialized . From outer class perspective the local variable can never be serialized

无法访问

无法访问

由于静态变量没有被序列化,因此没有外部类对象被序列化.

As static variables are not serialized hence no outer class object is serialized.

静态变量的值取自类状态

The value of the static variable is taken from the class state

出现在那个执行者身上.

present on that executor .

外部类未序列化并与序列化的静态内部类一起发送

Outer class is not serialized and send along with the serialized Static inner class

思考要点:

  1. 遵循Java序列化规则来选择需要序列化的类对象.

  1. Java Serialization rules are followed to select which class object needs to be serialized .

使用javap -p -c "abc.class"解开字节码,查看编译器生成的代码

Use javap -p -c "abc.class" to unwrap the byte code and see the compiler generated code

根据您在外部类的内部类中尝试访问的内容,编译器会生成不同的字节码.

Depending on what you are trying to access within the inner class of the outer class, compiler generates different byte code.

您不需要让类实现只能在驱动程序上访问的序列化.

You don't need to make classes implement Serialization which are only accessed on driver .

在 RDD 中使用的任何匿名/静态类(所有 lambda 函数都是匿名类)将在驱动程序上实例化.

Any anonymous/static class (all lambda function are anonymous class) used within RDD will be instantiated on the driver .

在 RDD 中使用的任何类/变量都将在驱动程序上实例化并发送给执行程序.

Any class/variable used inside RDD will be instantiated on driver and sent to the executors .

任何声明为瞬态的实例变量都不会在驱动程序上序列化.

Any instance variable declared transient will not be serialized on driver.

  1. 默认情况下,匿名类将强制您使外部类可序列化.
  2. 任何局部变量/对象都不必是可序列化的.
  3. 只有在匿名类内部使用局部变量时才需要序列化
  4. 可以在 pair,mapToPair 函数的 call() 方法中创建单例,从而确保它永远不会在驱动程序上初始化
  5. 静态变量永远不会被序列化,因此永远不会被发送从驱动程序到执行程序

  • 如果您需要仅在执行器上执行任何服务,请将它们设置为 lambda 函数内的静态字段,或者将它们设置为瞬态和单例并检查空条件以实例化它们
  • 这篇关于理解 Spark 序列化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆