需要KRYO序列化火花(斯卡拉) [英] Require kryo serialization in Spark (Scala)

查看:152
本文介绍了需要KRYO序列化火花(斯卡拉)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经KRYO系列化打开这个:

  conf.set(spark.serializer,org.apache.spark.serializer.KryoSerializer)

我要确保节点之间的洗牌当一个自定义类是使用KRYO序列化。我可以这样注册KRYO类:

  conf.registerKryoClasses(阵列(classOf [美孚]))

据我了解,这实际上并不保证KYRO序列化使用;如果串行器不可用,KRYO将回落到Java序列化。

要保证KRYO系列化发生,我也跟着从星火文档这一建议:

  conf.set(spark.kryo.registrationRequired,真)

不过,这将导致IllegalArugmentException抛出(类未注册),对很多不同的类,我认为星火内部使用的,例如以下内容:

  org.apache.spark.util.collection.CompactBuffer
scala.Tuple3

当然,我没有手动注册每个单独的班级,KRYO的?这些序列化器在KRYO所有定义的,所以有没有办法自动注册所有的人?


解决方案

  

据我了解,这实际上并不保证KYRO序列化使用;如果串行器不可用,KRYO将回落到Java序列化。


没有。如果设置 spark.serializer org.apache.spark.serializer。
KryoSerializer
然后星火将使用KRYO。如果KRYO不可用,你会得到一个错误。没有回退。

那么,什么是这个KRYO注册呢?

在KRYO序列化它输出完整的类名未注册的类的实例。这是一个很大的字符。相反,如果一个类已经pre-注册,KRYO正好可以输出数字参考这个类,这仅仅是1-2个字节。

当一个RDD的每一行与KRYO序列,这是特别重要的。你不希望包括每一个十亿行相同的类名。所以你pre-注册这些类。但它很容易忘记注册一个新的类,然后你又浪费字节。该解决方案是要求每个类别登记

  conf.set(spark.kryo.registrationRequired,真)

现在KRYO绝不会输出全类名。如果遇到一个未注册的类,这是一个运行时错误。

不幸的是,很难列举所有,你会提前序列化的类。这个想法是,星火注册特定星火类,和您注册一切。你有一个 RDD [(X,Y,Z)] ?你必须注册 classOf [scala.Tuple3 [_,_,_]

的<一个href=\"https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L317\">list的火花寄存器的实际包括类 CompactBuffer ,所以如果你看到一个错误,你做错了什么。您绕过星火登记手续。你必须为使用 spark.kryo.classesToRegister spark.kryo.registrator 来注册类。 (见配置选项的。如果你使用GraphX​​,你registrator应该叫<一个href=\"https://github.com/apache/spark/blob/v1.4.0/graphx/src/main/scala/org/apache/spark/graphx/GraphXUtils.scala#L30\">GraphXUtils. registerKryoClasses 。)

I have kryo serialization turned on with this:

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )

I want to ensure that a custom class is serialized using kryo when shuffled between nodes. I can register the class with kryo this way:

conf.registerKryoClasses(Array(classOf[Foo]))

As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.

To guarantee that kryo serialization happens, I followed this recommendation from the Spark documentation:

conf.set("spark.kryo.registrationRequired", "true")

But this causes IllegalArugmentException to be thrown ("Class is not registered") for a bunch of different classes which I assume Spark uses internally, for example the following:

org.apache.spark.util.collection.CompactBuffer
scala.Tuple3

Surely I do not have to manually register each of these individual classes with kryo? These serializers are all defined in kryo, so is there a way to automatically register all of them?

解决方案

As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.

No. If you set spark.serializer to org.apache.spark.serializer. KryoSerializer then Spark will use Kryo. If Kryo is not available, you will get an error. There is no fallback.

So what is this Kryo registration then?

When Kryo serializes an instance of an unregistered class it has to output the fully qualified class name. That's a lot of characters. Instead, if a class has been pre-registered, Kryo can just output a numeric reference to this class, which is just 1-2 bytes.

This is especially crucial when each row of an RDD is serialized with Kryo. You don't want to include the same class name for each of a billion rows. So you pre-register these classes. But it's easy to forget to register a new class and then you're wasting bytes again. The solution is to require every class to be registered:

conf.set("spark.kryo.registrationRequired", "true")

Now Kryo will never output full class names. If it encounters an unregistered class, that's a runtime error.

Unfortunately it's hard to enumerate all the classes that you are going to be serializing in advance. The idea is that Spark registers the Spark-specific classes, and you register everything else. You have an RDD[(X, Y, Z)]? You have to register classOf[scala.Tuple3[_, _, _]].

The list of classes that Spark registers actually includes CompactBuffer, so if you see an error for that, you're doing something wrong. You are bypassing the Spark registration procedure. You have to use either spark.kryo.classesToRegister or spark.kryo.registrator to register your classes. (See the config options. If you use GraphX, your registrator should call GraphXUtils. registerKryoClasses.)

这篇关于需要KRYO序列化火花(斯卡拉)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆