了解Spark序列化 [英] Understanding Spark serialization

查看:88
本文介绍了了解Spark序列化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark中,如何知道哪些对象在驱动程序上实例化,哪些对象在执行程序上实例化,因此如何确定哪些类需要实现Serializable?

In Spark how does one know which objects are instantiated on driver and which are instantiated on executor , and hence how does one determine which classes needs to implement Serializable ?

推荐答案

序列化对象意味着将其状态转换为字节流,以便可以将字节流还原回该对象的副本.如果Java对象的类或其任何超类实现了java.io.Serializable接口或其子接口java.io.Externalizable,则它是可序列化的.

To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object. A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subinterface, java.io.Externalizable.

一个类从不被序列化,只有一个类的对象被序列化.如果需要持久化对象或通过网络传输对象,则需要对象序列化.

A class is never serialized only object of a class is serialized. Object serialization is needed if object needs to be persisted or transmitted over the network .

Class Component            Serialization
instance variable           yes
Static instance variable    no
methods                     no
Static methods              no
Static inner class          no
local variables             no

让我们看一下示例Spark代码并经历各种场景

Let's take a sample Spark code and go through various scenarios

public class SparkSample {

      public int instanceVariable                =10 ;
      public static int staticInstanceVariable   =20 ;

      public int run(){

         int localVariable                       =30;

         // create Spark conf
         final SparkConf sparkConf = new SparkConf().setAppName(config.get(JOB_NAME).set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

         // create spark context 
         final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

        // read DATA 
        JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD(); 


        // Anonymous class used for lambda implementation
        JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
                @Override
                public Iterator<String> call(String s) {
                // How will the listed varibles be accessed in RDD across driver and Executors 
                System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable);
                return Arrays.asList(SPACE.split(s)).iterator();
        });

        // SAVE OUTPUT
        words.saveAsTextFile(OUTPUT_PATH));

      }

       // Inner Static class for the funactional interface which can replace the lambda implementation above 
       public static class MapClass extends FlatMapFunction<String, String>() {
                @Override
                public Iterator<String> call(String s) {
                System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable);
                return Arrays.asList(SPACE.split(s)).iterator();
        }); 

        public static void main(String[] args) throws Exception {
            JavaWordCount count = new JavaWordCount();
            count.run();
        }
}

内部类对象中外部类的实例变量的可访问性和可序列化性

Accessibility and Serializability of instance variable from Outer Class inside inner class objects

    Inner class        | Instance Variable (Outer class) | Static Instance Variable (Outer class) | Local Variable (Outer class)

    Anonymous class    | Accessible And Serialized       | Accessible yet not Serialized          | Accessible And Serialized 

    Inner Static class | Not Accessible                  | Accessible yet not Serialized          | Not Accessible 

了解Spark工作的经验法则是:

Rule of thumb while understanding Spark job is :

  1. 在RDD中编写的所有lambda函数都在驱动程序上实例化,并且对象被序列化并发送给执行者

  1. All the lambda functions written inside the RDD are instantiated on the driver and the objects are serialized and sent to the executors

如果在内部类中访问任何外部类实例变量,则编译器将应用不同的逻辑来访问它们,因此外部类是否要序列化取决于您访问的内容.

If any outer class instance variables are accessed within the inner class, compiler apply different logic to access them, hence outer class gets serialized or not depends what do you access.

就Java而言,整个争论都是关于外部类与内部类,以及访问外部类引用和变量如何导致序列化问题.

In terms of Java, the whole debate is about Outer class vs Inner class and how does accessing outer class references and variables leads to serialization issues.

各种情况:

默认情况下,编译器会在

Compiler by default inserts constructor in the byte code of the

引用外部类对象的匿名类.

Anonymous class with reference to Outer class object .

外部类对象用于访问实例变量

The outer class object is used to access the instance variable

Anonymous-class(){

Anonymous-class(){

 final Outer-class reference;

 Anonymous-class( Outer-class outer-reference){

reference = outer-reference;

}

}

外部类被序列化并与 内部匿名类的序列化对象

The outer class is serialized and sent along with the serialized object of the inner anonymous class

由于静态变量未序列化,因此外部类 对象仍插入到Anonymous类构造函数中.

As static variables are not serialized , outer class object is still inserted into the Anonymous class constructor .

静态变量的值取自类状态

The value of the static variable is taken from the class state

出现在那个执行人上.

默认情况下,编译器会在

Compiler by default inserts constructor in the byte code of the

引用外部类对象和局部变量引用的匿名类.

Anonymous class with reference to Outer class object AND local variable refrence.

外部类对象用于访问实例变量

The outer class object is used to access the instance variable

Anonymous-class(){

Anonymous-class(){

 final Outer-class reference;

final Local-variable localRefrence ;

 Anonymous-class( Outer-class outer-reference, Local-variable localRefrence){

reference = outer-reference;

this.localRefrence = localRefrence;

}

}

外部类已序列化,局部变量对象也是

The outer class is serialized , and the local variable object is also

序列化并与内部匿名类的序列化对象一起发送

serialized and sent along with the serialized object of the inner anonymous class

当局部变量成为匿名类内的实例成员时,需要对其进行序列化.从外部类的角度来看,本地变量永远无法序列化

As the local variable becomes a instance member inside the anonymous class it needs to be serialized . From outer class perspective the local variable can never be serialized

无法访问

无法访问

由于静态变量未序列化,因此没有外部类对象被序列化.

As static variables are not serialized hence no outer class object is serialized.

静态变量的值取自类状态

The value of the static variable is taken from the class state

出现在那个执行人上.

外部类未序列化,并且与序列化的Static内部类一起发送

Outer class is not serialized and send along with the serialized Static inner class

思考的要点:

    遵循
  1. Java序列化规则,以选择需要序列化的类对象.

  1. Java Serialization rules are followed to select which class object needs to be serialized .

使用javap -p -c"abc.class"解开字节码,然后查看编译器生成的代码

Use javap -p -c "abc.class" to unwrap the byte code and see the compiler generated code

根据您要在外部类的内部类中尝试访问的内容,编译器会生成不同的字节码.

Depending on what you are trying to access within the inner class of the outer class, compiler generates different byte code.

您不需要使类实现仅在驱动程序上访问的序列化.

You don't need to make classes implement Serialization which are only accessed on driver .

在RDD中使用的任何匿名/静态类(所有lambda函数均为匿名类)都将在驱动程序上实例化.

Any anonymous/static class (all lambda function are anonymous class) used within RDD will be instantiated on the driver .

在RDD中使用的任何类/变量都将在驱动程序上实例化并发送给执行者.

Any class/variable used inside RDD will be instantiated on driver and sent to the executors .

任何声明为瞬态的实例变量都不会在驱动程序上序列化.

Any instance variable declared transient will not be serialized on driver.

  1. 默认情况下,匿名类将迫使您使外部类可序列化.
  2. 任何局部变量/对象都不必可序列化.
  3. 仅当在Anonymous类中使用局部变量时才需要序列化
  4. 一个人可以在pair,mapToPair函数的call()方法内创建单例,从而确保它从未在驱动程序上初始化
  5. 静态变量从不序列化,因此从不发送 从司机到执行人
  1. By default Anonymous classes will force you to make the outer class serializable.
  2. Any local variable/object need not have to be serializable .
  3. Only if local variable is used inside the Anonymous class needs to be serialized
  4. One can create singleton inside the call() method of pair,mapToPair function , thus making sure its never initialized on driver
  5. static variables are never serialized hence are never sent from driver to executors

  • 如果您只需要在执行程序上执行任何服务,请在lambda函数中将它们设置为静态字段,或者将它们设置为瞬态和singelton并检查是否为空条件以实例化它们
  • 这篇关于了解Spark序列化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆