Spark编码器:何时使用beans() [英] Spark Encoders: when to use beans()
问题描述
在使用Spark的缓存机制时,我遇到了内存管理问题.我目前在Kryo中使用Encoder
,并且想知道切换到Bean是否可以帮助我减少缓存数据集的大小.
I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoder
s with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset.
基本上,在使用Encoder
时,在Kryo序列化上使用bean的优缺点是什么?是否有任何性能改进?除了使用SER选项进行缓存外,是否有办法压缩缓存的Dataset
?
Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoder
s? Are there any performance improvements? Is there a way to compress a cached Dataset
apart from caching with SER option?
记录下来,我发现了一个类似的主题解决了两者之间的比较.但是,此比较没有详细介绍.
For the record, I have found a similar topic that tackles the comparison between the two. However, it doesn't go into the details of this comparison.
推荐答案
只要有可能.与通用二进制文件Encoders
使用通用二进制序列化并将整个对象存储为不透明的blob不同,Encoders.bean[T]
利用对象的结构来提供类特定的存储布局.
Whenever you can. Unlike generic binary Encoders
, which use general purpose binary serialization and store whole objects as opaque blobs, Encoders.bean[T]
leverages the structure of an object, to provide class specific storage layout.
当您比较使用Encoders.bean
和Encoders.kryo
创建的架构时,这种差异变得显而易见.
This difference becomes obvious when you compare the schemas created using Encoders.bean
and Encoders.kryo
.
为什么重要?
- 您可以使用SQL API进行有效的字段访问,而无需进行反序列化并完全支持所有
Dataset
转换. - 通过透明字段序列化,您可以充分利用列式存储,包括内置压缩.
- You get efficient field access using SQL API without any need for deserialization and full support for all
Dataset
transformations. - With transparent field serialization you can fully utilize columnar storage, including built-in compression.
那么什么时候使用kryo
Encoder
?通常,当没有其他方法起作用时.就个人而言,我将完全避免将其用于数据序列化.我能想到的唯一真正有用的应用程序是聚合缓冲区的序列化(例如检查如何在Spark SQL中查找分组的Vector列的均值? ).
So when to use kryo
Encoder
? In general when nothing else works. Personally I would avoid it completely for data serialization. The only really useful application I can think of is serialization of aggregation buffer (check for example How to find mean of grouped Vector columns in Spark SQL?).
这篇关于Spark编码器:何时使用beans()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!