Kryo 对 SparkSQL 有帮助吗? [英] Does Kryo help in SparkSQL?

查看:43
本文介绍了Kryo 对 SparkSQL 有帮助吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Kryo 通过高效的序列化方法帮助提高 Spark 应用程序的性能.
我想知道 Kryo 是否会在 SparkSQL 的情况下有所帮助,我应该如何使用它.
在 SparkSQL 应用程序中,我们会做很多基于列的操作,比如 df.select($"c1", $"c2"),DataFrame Row 的 schema 不是很静态.
不确定如何为用例注册一个或多个序列化程序类.

Kryo helps improve the performance of Spark applications by the efficient serialization approach.
I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it.
In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2"), and the schema of DataFrame Row is not quite static.
Not sure how to register one or several serializer classes for the use case.

例如:

case class Info(name: String, address: String)
...
val df = spark.sparkContext.textFile(args(0))
         .map(_.split(','))
         .filter(_.length >= 2)
         .map {e => Info(e(0), e(1))}
         .toDF
df.select($"name") ... // followed by subsequent analysis
df.select($"address") ... // followed by subsequent analysis

我认为为每个 select 定义案例类不是一个好主意.
或者,如果我像 registerKryoClasses(Array(classOf[Info]))

I don't think it's a good idea to define case classes for each select.
Or does it help if I register Info like registerKryoClasses(Array(classOf[Info]))

推荐答案

根据 Spark 的文档,SparkSQL 不使用 Kryo 或 Java 序列化.

According to Spark's documentation, SparkSQL does not uses Kryo or Java serializations.

数据集类似于 RDD,但是,它们不使用 Java 序列化或 Kryo,而是使用专门的编码器来序列化对象,以便通过网络进行处理或传输.虽然编码器和标准序列化都负责将对象转换为字节,但编码器是动态生成的代码,使用的格式允许 Spark 执行过滤、排序和散列等许多操作,而无需将字节反序列化回对象.

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network. While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

它们比 Java 或 Kryo 轻得多,这是意料之中的(序列化是一项更可优化的工作,比如一行 3 个 long 和两个 int),而不是一个类、它的版本描述、它的内部变量...)并且必须实例化它.

They are much more lightweight than Java or Kryo, which is to be expected (it is a far more optimizable job to serialize, say a Row of 3 longs and two ints), than a class, its version description, its inner variables...) and having to instanciate it.

话虽如此,有一种方法可以将 Kryo 用作编码器实现,请参见此处的示例:如何在 Dataset 中存储自定义对象? .但这意味着将自定义对象(例如非产品类)存储在数据集中的解决方案,而不是专门针对标准数据帧.

That being said, there is a way to use Kryo as an encoder implementation, see for example here : How to store custom objects in Dataset? . But this is meant as a solution to store custom objects (e.g. non product classes) in a Dataset, and not especially targeted at standard dataframes.

如果没有 Java 序列化程序的 Kryo,为自定义的非产品类创建编码器会有些受限(请参阅有关用户定义类型的讨论),例如,从这里开始:Apache spark 2.2 是否支持用户自定义类型(UDT)?

Without Kryo of Java serializers, creating encoders for custom, non product classes is somewhat limited (see the discussions on user defined types), for example, starting here : Does Apache spark 2.2 supports user-defined type (UDT)?

这篇关于Kryo 对 SparkSQL 有帮助吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆