将Spark的Kryo序列化程序与具有字符串数组的Java协议缓冲区一起使用时出错 [英] Error using Spark's Kryo serializer with java protocol buffers that have arrays of strings
问题描述
在将Java协议缓冲区类用作Spark作业中的RDD的对象模型时,我遇到了一个错误,
I am hitting a bug when using java protocol buffer classes as the object model for RDDs in Spark jobs,
对于我的应用程序,我的proto文件具有重复字符串的属性.例如
For my application, my ,proto file has properties that are repeated string. For example
message OntologyHumanName
{
repeated string family = 1;
}
由此,2.5.0协议编译器生成如下Java代码
From this, the 2.5.0 protoc compiler generates Java code like
private com.google.protobuf.LazyStringList family_ = com.google.protobuf.LazyStringArrayList.EMPTY;
如果我运行使用Kryo序列化程序的Scala Spark作业,则会出现以下错误
If I run a Scala Spark job that uses the Kryo serializer I get the following error
Caused by: java.lang.NullPointerException
at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:61)
at java.util.AbstractList.add(AbstractList.java:108)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
... 40 more
相同的代码可以在spark.serializer = org.apache.spark.serializer.JavaSerializer中正常工作.
The same code works fine with spark.serializer=org.apache.spark.serializer.JavaSerializer.
我的环境是带有JDK 1.8.0_60的CDH QuickStart 5.5
My environment is CDH QuickStart 5.5 with JDK 1.8.0_60
推荐答案
尝试向 Lazy
类注册:
Kryo kryo = new Kryo()
kryo.register(com.google.protobuf.LazyStringArrayList.class)
对于自定义Protobuf消息,也请查看此 answer 中用于注册由<代码>协议.
Also for custom Protobuf messages take a look at the solution in this answer for registering custom/nestes classes generated by protoc
.
这篇关于将Spark的Kryo序列化程序与具有字符串数组的Java协议缓冲区一起使用时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!