Spark编码器:何时使用beans() [英] Spark Encoders: when to use beans()

查看：275 发布时间：2020/5/8 22:23:36 java apache-spark memory-management apache-spark-dataset apache-spark-encoders

本文介绍了Spark编码器:何时使用beans()的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在使用Spark的缓存机制时，我遇到了内存管理问题.我目前在Kryo中使用Encoder，并且想知道切换到Bean是否可以帮助我减少缓存数据集的大小.

I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoders with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset.

基本上，在使用Encoder时，在Kryo序列化上使用bean的优缺点是什么?是否有任何性能改进?除了使用SER选项进行缓存外，是否有办法压缩缓存的Dataset?

Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoders? Are there any performance improvements? Is there a way to compress a cached Dataset apart from caching with SER option?

记录下来，我发现了一个类似的主题解决了两者之间的比较.但是，此比较没有详细介绍.

For the record, I have found a similar topic that tackles the comparison between the two. However, it doesn't go into the details of this comparison.

推荐答案

只要有可能.与通用二进制文件Encoders使用通用二进制序列化并将整个对象存储为不透明的blob不同，Encoders.bean[T]利用对象的结构来提供类特定的存储布局.

Whenever you can. Unlike generic binary Encoders, which use general purpose binary serialization and store whole objects as opaque blobs, Encoders.bean[T] leverages the structure of an object, to provide class specific storage layout.

当您比较使用Encoders.bean和Encoders.kryo创建的架构时，这种差异变得显而易见.

This difference becomes obvious when you compare the schemas created using Encoders.bean and Encoders.kryo.

为什么重要?

您可以使用SQL API进行有效的字段访问，而无需进行反序列化并完全支持所有Dataset转换.
通过透明字段序列化，您可以充分利用列式存储，包括内置压缩.

You get efficient field access using SQL API without any need for deserialization and full support for all Dataset transformations.
With transparent field serialization you can fully utilize columnar storage, including built-in compression.

那么什么时候使用kryo Encoder?通常，当没有其他方法起作用时.就个人而言，我将完全避免将其用于数据序列化.我能想到的唯一真正有用的应用程序是聚合缓冲区的序列化(例如检查如何在Spark SQL中查找分组的Vector列的均值? ).

So when to use kryo Encoder? In general when nothing else works. Personally I would avoid it completely for data serialization. The only really useful application I can think of is serialization of aggregation buffer (check for example How to find mean of grouped Vector columns in Spark SQL?).

这篇关于Spark编码器:何时使用beans()的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark编码器:何时使用beans() [英] Spark Encoders: when to use beans()

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Spark编码器:何时使用beans() [英] Spark Encoders: when to use beans()

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭