处理序列化框架的不兼容版本更改 [英] Dealing with an incompatible version change of a serialization framework

查看:159
本文介绍了处理序列化框架的不兼容版本更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题描述



我们有一个Hadoop集群,我们使用 Kryo (一个序列化框架)。我们用来做这件事的Kryo版本已经从官方版本2.21中分离出来,将我们自己的补丁应用到我们使用Kryo时遇到的问题上。目前的Kryo 2.22版本也解决了这些问题,但采用了不同的解决方案。因此,我们不能只更改我们使用的Kryo版本,因为这意味着我们将无法再读取已存储在Hadoop集群中的数据。为了解决这个问题,我们希望运行一个Hadoop作业,其中
$ b


  1. 读取存储的数据

  2. 反序列化存储在旧版Kryo中的数据

  3. 使用新版本的Kryo序列化恢复的对象

  4. 将新的序列化表示重新写入我们的数据存储

问题在于,在一个Java程序中使用同一类的两个不同版本是不平凡的(更确切地说,在Hadoop作业的映射类中)。



问题概述



如何反序列化和序列化在一个Hadoop工作中有两个不同版本的相同序列化框架的对象?



相关事实概述




  • 我们将数据存储在Hadoop CDH4群集上,并使用Kryo 2.21.2版本 - 我们的补丁分支

  • 序列化我们希望将数据与Kryo版本2.22进行序列化,与我们的不兼容版本

  • 我们使用Apache Maven构建Hadoop作业JAR



可能(和不可能)方法



(1)重命名包



我们想到的第一种方法是将包重命名为我们自己的Kryo分支使用 Maven Shade插件的重定位功能,并使用不同的工件ID进行发布,这样我们就可以依靠我们的转换作业项目中的两个工件。然后,我们将实例化一个旧版本和新版本的Kryo对象,并使用旧版本进行反序列化,并使用新版本对序列化对象再次进行。


$ b

问题

我们不明确在Hadoop作业中使用Kryo,而是通过我们自己的多个库来访问它。对于这些库中的每一个,都有必要重新命名相关的软件包和


  1. 创建一个与其他第三方库提供的Kryo序列化器一起使用,以便为其他组织或工件标识发布

我们必须做同样的事情。




(2)使用多个类加载器



我们提出的第二种方法是在Maven项目中完全不依赖Kryo,它包含转换作业,但是从每个版本的JAR中加载所需的类,该版本存储在Hadoop的分布式缓存。然后序列化一个对象就像这样:

  public byte [] serialize(Object foo,JarClassLoader cl){
final Class<> kryoClass = cl.loadClass(com.esotericsoftware.kryo.Kryo);
Object k = kryoClass.getConstructor()。newInstance();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
final Class<?> outputClass = cl.loadClass(com.esotericsoftware.kryo.io.Output);

Object output = outputClass.getConstructor(OutputStream.class).newInstance(baos);
方法writeObject = kryoClass.getMethod(writeObject,outputClass,Object.class);
writeObject.invoke(k,output,foo);
outputClass.getMethod(close)。invoke(output);
baos.close();
byte [] bytes = baos.toByteArray();
返回字节;

问题

虽然这种方法可能会实例化一个未配置的Kryo对象并序列化/恢复一些对象,我们使用更复杂的Kryo配置。这包括几个自定义序列化器,注册类ID等等。例如,我们无法找到一种方法来设置类的自定义序列化器,而不会收到NoClassDefFoundError - 下面的代码不起作用:

 类<?> kryoClass = this.loadClass(com.esotericsoftware.kryo.Kryo); 
Object kryo = kryoClass.getConstructor()。newInstance();
方法addDefaultSerializer = kryoClass.getMethod(addDefaultSerializer,Class.class,Class.class);
addDefaultSerializer.invoke(kryo,URI.class,URISerializer.class); //抛出NoClassDefFoundError

最后一行抛出

  java.lang.NoClassDefFoundError:com / esotericsoftware / kryo / Serializer 

是因为 URISerializer 类引用了Kryo的 Serializer 类,并尝试使用它自己的类加载器加载它(它是System类加载器),它不知道 Serializer 类。




(3)使用中间序列化



最有希望的方法似乎是使用独立的中间序列化,例如JSON使用 Gson 或类似方式,然后运行两个独立的作业:
$ b


  1. kryo:2.21.2-ourpatchbranch在我们的普通商店 - >临时商店中的JSON

  2. 临时商店中的JSON - > kryo:我们常规商店中的2至22美元

问题 >
这个解决方案最大的问题在于,它大约是处理数据的空间消耗的两倍。此外,我们需要另一种序列化方法,它对我们所有的数据都没有问题,我们首先需要进行调查。

我会使用多个类加载器的方法。



(包重命名也会起作用,它看起来很丑陋,但这是一次性黑客攻击,因此美观和正确性可能需要中间序列化似乎有风险 - 有一个原因是你使用Kryo,而这个原因将被使用不同的中间形式否定)。



整体设计会be:

 子类加载器:旧Kryo新Kryo<  - 都带简单包装
\ /
\ /
\ /
\ /
|
默认classloader:域模型;用于重新序列化的控制器




  1. 将域对象类加载到默认值classloader

  2. 使用修改后的Kryo版本和包装器代码加载Jar。包装器具有一个带有一个参数的静态'main'方法:要反序列化的文件的名称。通过默认类加载器的反射调用main方法:

      Class deserializer = deserializerClassLoader.loadClass(com.example.deserializer.Main ); 
    方法mainIn = deserializer.getMethod(main,String.class);
    Object graph = mainIn.invoke(null,/ path / to / input / file);




    1. 此方法:


      1. 将文件反序列化为一个对象图

      2. 将对象放入共享空间。 ThreadLocal 是一种简单的方法,或者返回它到包装脚本。



  3. 当调用返回时,加载第二个Jar序列化框架与一个简单的包装。包装器有一个静态的'main'方法和一个参数来传递文件的名字来进行序列化。通过从默认的classloader反射来调用main方法:

      Class serializer = deserializerClassLoader.loadClass(com.example.serializer.Main); 
    方法mainOut = deserializer.getMethod(main,Object.class,String.class);
    mainOut.invoke(null,graph,/ path / to / output / file);




    1. 这种方法


      1. 从ThreadLocal中检索对象

      2. 对对象进行序列化并将其写入文件中





注意事项在代码片段中,为每个对象序列化和反序列化创建一个类加载器。您可能只想加载一次类加载器,发现主要方法并遍历文件,如下所示:

  for(String file:files){
Object graph = mainIn.invoke(null,file +.in);
mainOut.invoke(null,graph,file +.out);

$ / code>

域对象是否有任何对的引用 Kryo课程?如果是这样,你有困难:


  1. 如果引用只是一个类引用,例如调用一个方法,那么第一次使用类会将两个Kryo版本中的一个加载到默认的类加载器中。这可能会导致问题,因为序列化或反序列化的一部分可能由Kryo的错误版本执行

  2. 如果引用用于实例化任何Kryo对象,并且将参考存储在域模型(类或实例成员)中,然后Kryo实际上将在模型中序列化其本身的一部分。这可能是这种方法的一个破坏行为。

在任何一种情况下,您的第一种方法应该是检查这些引用并将其删除。确保你完成这一工作的一种方法是确保默认的类加载器不能访问任何 Kryo版本。如果域对象以任何方式引用Kryo,则引用将失败(如果直接引用该类,则使用ClassNotFoundError;如果使用反射,则使用ClassNotFoundException)。

Problem description

We have a Hadoop cluster on which we store data which is serialized to bytes using Kryo (a serialization framework). The Kryo version which we used to do this has been forked from the official release 2.21 to apply our own patches to issues we have experienced using Kryo. The current Kryo version 2.22 also fixes these issues, but with different solutions. As a result, we cannot just change the Kryo version we use, because this would mean that we would no longer be able to read the data which is already stored on our Hadoop cluster. To address this problem, we want to run a Hadoop job which

  1. reads the stored data
  2. deserializes the data stored with the old version of Kryo
  3. serializes the restored objects with the new version of Kryo
  4. writes the new serialized representation back to our data store

The problem is that it is not trivial to use two different versions of the same class in one Java program (more precisely, in a Hadoop job's mapper class).

Question in a nutshell

How is it possible to deserialize and serialize an object with two different versions of the same serialization framework in one Hadoop job?

Relevant facts overview

  • We have data stored on a Hadoop CDH4 cluster, serialized with a Kryo version 2.21.2-ourpatchbranch
  • We want to have the data serialized with Kryo version 2.22, which is incompatible to our version
  • We build our Hadoop job JARs with Apache Maven

Possible (and impossible) approaches

(1) Renaming packages

The first approach which has come to our minds was to rename the packages in our own Kryo branch using the relocation functionality of the Maven Shade plugin and release it with a different artifact ID so we could depend on both artifacts in our conversion job project. We would then instantiate one Kryo object of both the old and the new version and use the old one for deserialization and the new one for serializing the object again.

Problems
We don't use Kryo explicitly in Hadoop jobs, but rather access it through multiple layers of our own libraries. For each of these libraries, it would be necessary to

  1. rename involved packages and
  2. create a release with a different group or artifact ID

To make things even more messy, we also use Kryo serializers provided by other 3rd party libraries for which we would have to do the same thing.


(2) Using multiple class loaders

The second approach we came up with was to not depend on Kryo at all in the Maven project which contains the conversion job, but load the required classes from a JAR for each version, which is stored in Hadoop's distributed cache. Serializing an object would then look something like this:

public byte[] serialize(Object foo, JarClassLoader cl) {
    final Class<?> kryoClass = cl.loadClass("com.esotericsoftware.kryo.Kryo");
    Object k = kryoClass.getConstructor().newInstance();
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    final Class<?> outputClass = cl.loadClass("com.esotericsoftware.kryo.io.Output");

    Object output = outputClass.getConstructor(OutputStream.class).newInstance(baos);
    Method writeObject = kryoClass.getMethod("writeObject", outputClass, Object.class);
    writeObject.invoke(k, output, foo);
    outputClass.getMethod("close").invoke(output);
    baos.close();
    byte[] bytes = baos.toByteArray();
    return bytes;
}

Problems
Though this approach might work to instantiate an unconfigured Kryo object and serialize / restore some object, we use a much more complex Kryo configuration. This includes several custom serializers, registered class ids et cetera. We were for example unable to figure out a way to set custom serializers for classes without getting a NoClassDefFoundError - the following code does not work:

Class<?> kryoClass = this.loadClass("com.esotericsoftware.kryo.Kryo");
Object kryo = kryoClass.getConstructor().newInstance();
Method addDefaultSerializer = kryoClass.getMethod("addDefaultSerializer", Class.class, Class.class);
addDefaultSerializer.invoke(kryo, URI.class, URISerializer.class); // throws NoClassDefFoundError

The last line throws a

java.lang.NoClassDefFoundError: com/esotericsoftware/kryo/Serializer

because the URISerializer class references Kryo's Serializer class and tries to load it using its own class loader (which is the System class loader), which does not know the Serializer class.


(3) Using an intermediate serialization

Currently the most promising approach seems to be using an independant intermediate serialization, e.g. JSON using Gson or alike, and then running two separate jobs:

  1. kryo:2.21.2-ourpatchbranch in our regular store -> JSON in a temporary store
  2. JSON in the temporary store -> kryo:2-22 in our regular store

Problems
The biggest problem with this solution is the fact that it roughly doubles the space consumption of the data processed. Moreover, we need another serialization method which works without problems on all of our data, which we would need to investigate first.

解决方案

I would use the multiple classloaders approach.

(Package renaming will also work. It does seem ugly, but this is a one-off hack so beauty and correctness can take a back seat. Intermediate serialization seems risky - there was a reason you are using Kryo, and that reason will be negated by using a different intermediate form).

The overall design would be:

child classloaders:      Old Kryo     New Kryo   <-- both with simple wrappers
                                \       /
                                 \     /
                                  \   /
                                   \ /
                                    |
default classloader:    domain model; controller for the re-serialization

  1. Load the domain object classes in the default classloader
  2. Load a Jar with the modified Kryo version and wrapper code. The wrapper has a static 'main' method with one argument: The name of the file to deserialize. Call the main method via reflection from the default classloader:

        Class deserializer = deserializerClassLoader.loadClass("com.example.deserializer.Main");
        Method mainIn = deserializer.getMethod("main", String.class);
        Object graph = mainIn.invoke(null, "/path/to/input/file");
    

    1. This method:

      1. Deserializes the file as one object graph
      2. Places the object into a shared space. ThreadLocal is a simple way, or returning it to the wrapper script.

  3. When the call returns, load a second Jar with the new serialization framework with a simple wrapper. The wrapper has a static 'main' method and an argument to pass the name of the file to serialize in. Call the main method via reflection from the default classloader:

        Class serializer = deserializerClassLoader.loadClass("com.example.serializer.Main");
        Method mainOut = deserializer.getMethod("main", Object.class, String.class);
        mainOut.invoke(null, graph, "/path/to/output/file");
    

    1. This method

      1. Retrieves the object from the ThreadLocal
      2. Serializes the object and writes it to the file

Considerations

In the code fragments, one classloader is created for each object serialization and deserialization. You probably want to load the classloaders only once, discover the main methods and loop over the files, something like:

for (String file: files) {
    Object graph = mainIn.invoke(null, file + ".in");
    mainOut.invoke(null, graph, file + ".out");
}

Do the domain objects have any reference to any Kryo class? If so, you have difficulties:

  1. If the reference is just a class reference, eg to call a method, then the first use of the class will load one of the two Kryo versions into the default classloader. This probably will cause problems as part of the serialization or deserialization might be performed by the wrong version of Kryo
  2. If the reference is used to instantiate any Kryo objects and store the reference in the domain model (class or instance members), then Kryo will actually be serializing part of itself in the model. This may be a deal-breaker for this approach.

In either case, your first approach should be to examine these references and eliminate them. One approach to ensure that you have done this is to ensure the default classloader does not have access to any Kryo version. If the domain objects reference Kryo in any way, the reference will fail (with a ClassNotFoundError if the class is referenced directly or ClassNotFoundException if reflection is used).

这篇关于处理序列化框架的不兼容版本更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆