Spark编码器:何时使用beans() [英] Spark Encoders: when to use beans()

查看:275
本文介绍了Spark编码器:何时使用beans()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用Spark的缓存机制时,我遇到了内存管理问题.我目前在Kryo中使用Encoder,并且想知道切换到Bean是否可以帮助我减少缓存数据集的大小.

I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoders with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset.

基本上,在使用Encoder时,在Kryo序列化上使用bean的优缺点是什么?是否有任何性能改进?除了使用SER选项进行缓存外,是否有办法压缩缓存的Dataset?

Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoders? Are there any performance improvements? Is there a way to compress a cached Dataset apart from caching with SER option?

记录下来,我发现了一个类似的主题解决了两者之间的比较.但是,此比较没有详细介绍.

For the record, I have found a similar topic that tackles the comparison between the two. However, it doesn't go into the details of this comparison.

推荐答案

只要有可能.与通用二进制文件Encoders使用通用二进制序列化并将整个对象存储为不透明的blob不同,Encoders.bean[T]利用对象的结构来提供类特定的存储布局.

Whenever you can. Unlike generic binary Encoders, which use general purpose binary serialization and store whole objects as opaque blobs, Encoders.bean[T] leverages the structure of an object, to provide class specific storage layout.

当您比较使用Encoders.beanEncoders.kryo创建的架构时,这种差异变得显而易见.

This difference becomes obvious when you compare the schemas created using Encoders.bean and Encoders.kryo.

为什么重要?

  • 您可以使用SQL API进行有效的字段访问,而无需进行反序列化并完全支持所有Dataset转换.
  • 通过透明字段序列化,您可以充分利用列式存储,包括内置压缩.
  • You get efficient field access using SQL API without any need for deserialization and full support for all Dataset transformations.
  • With transparent field serialization you can fully utilize columnar storage, including built-in compression.

那么什么时候使用kryo Encoder?通常,当没有其他方法起作用时.就个人而言,我将完全避免将其用于数据序列化.我能想到的唯一真正有用的应用程序是聚合缓冲区的序列化(例如检查如何在Spark SQL中查找分组的Vector列的均值? ).

So when to use kryo Encoder? In general when nothing else works. Personally I would avoid it completely for data serialization. The only really useful application I can think of is serialization of aggregation buffer (check for example How to find mean of grouped Vector columns in Spark SQL?).

这篇关于Spark编码器:何时使用beans()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆