Kryo在SparkSQL中有帮助吗? [英] Does Kryo help in SparkSQL?

查看:212
本文介绍了Kryo在SparkSQL中有帮助吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Kryo通过有效的序列化方法帮助提高Spark应用程序的性能.
我想知道Kryo是否会在SparkSQL方面提供帮助,以及我应该如何使用它.
在SparkSQL应用程序中,我们将执行许多基于列的操作,例如df.select($"c1", $"c2"),并且DataFrame Row的架构不是完全静态的.
不确定如何为用例注册一个或多个序列化程序类.

Kryo helps improve the performance of Spark applications by the efficient serialization approach.
I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it.
In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2"), and the schema of DataFrame Row is not quite static.
Not sure how to register one or several serializer classes for the use case.

例如:

case class Info(name: String, address: String)
...
val df = spark.sparkContext.textFile(args(0))
         .map(_.split(','))
         .filter(_.length >= 2)
         .map {e => Info(e(0), e(1))}
         .toDF
df.select($"name") ... // followed by subsequent analysis
df.select($"address") ... // followed by subsequent analysis

我认为为每个select定义案例类不是一个好主意.
或者如果我像registerKryoClasses(Array(classOf[Info]))

I don't think it's a good idea to define case classes for each select.
Or does it help if I register Info like registerKryoClasses(Array(classOf[Info]))

推荐答案

根据 Spark的文档,SparkSQL不使用Kryo或Java序列化.

According to Spark's documentation, SparkSQL does not uses Kryo or Java serializations.

数据集类似于RDD,但是,它们不使用Java序列化或Kryo,而是使用专用的Encoder对对象进行序列化以进行网络处理或传输.编码器和标准序列化均负责将对象转换为字节,而编码器是动态生成的代码,并使用一种格式,该格式允许Spark执行许多操作,如过滤,排序和哈希处理,而无需将字节反序列化为对象.

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network. While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

它们比Java或Kryo轻巧得多,这是可以预期的(序列化,例如3行长2个int的行,这是序列化的工作,要比类,版本说明,内部代码要容易得多).变量...)并且必须实例化.

They are much more lightweight than Java or Kryo, which is to be expected (it is a far more optimizable job to serialize, say a Row of 3 longs and two ints), than a class, its version description, its inner variables...) and having to instanciate it.

话虽如此,但有一种方法可以将Kryo用作编码器实现,请参见此处的示例:

That being said, there is a way to use Kryo as an encoder implementation, see for example here : How to store custom objects in Dataset? . But this is meant as a solution to store custom objects (e.g. non product classes) in a Dataset, and not especially targeted at standard dataframes.

没有Kryo的Java序列化程序,为自定义非产品类创建编码器会受到一定的限制(请参阅关于用户定义类型的讨论),例如,从此处开始:

Without Kryo of Java serializers, creating encoders for custom, non product classes is somewhat limited (see the discussions on user defined types), for example, starting here : Does Apache spark 2.2 supports user-defined type (UDT)?

这篇关于Kryo在SparkSQL中有帮助吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆