什么时候在Spark中使用Kryo序列化? [英] When to use Kryo serialization in Spark?

查看:183
本文介绍了什么时候在Spark中使用Kryo序列化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在使用conf.set("spark.rdd.compress","true")persist(MEMORY_AND_DISK_SER)压缩RDD.使用Kryo序列化会提高程序的效率,还是在这种情况下没有用?我知道Kryo用于以更有效的方式在节点之间发送数据.但是,如果通信的数据已经被压缩,甚至需要压缩吗?

I am already compressing RDDs using conf.set("spark.rdd.compress","true") and persist(MEMORY_AND_DISK_SER). Will using Kryo serialization make the program even more efficient, or is it not useful in this case? I know that Kryo is for sending the data between the nodes in a more efficient way. But if the communicated data is already compressed, is it even needed?

推荐答案

您描述(压缩和持久化)的两种RDD状态都使用序列化.当您保留一个RDD时,您要对其进行序列化并将其保存到磁盘(在您的情况下,还要压缩序列化的输出).没错,序列化也用于改组(在节点之间发送数据):每当数据需要离开JVM(无论是去本地磁盘还是通过网络)时,都需要进行序列化.

Both of the RDD states you described (compressed and persisted) use serialization. When you persist an RDD, you are serializing it and saving it to disk (in your case, compressing the serialized output as well). You are right that serialization is also used for shuffles (sending data between nodes): any time data needs to leave a JVM, whether it's going to local disk or through the network, it needs to be serialized.

Kryo是一个经过优化的序列化器,几乎在所有方面都比标准Java序列化器性能更好.就您而言,您实际上可能已经在使用Kryo.您可以检查您的spark配置参数:

Kryo is a significantly optimized serializer, and performs better than the standard java serializer for just about everything. In your case, you may actually be using Kryo already. You can check your spark configuration parameter:

"spark.serializer"应为"org.apache.spark.serializer.KryoSerializer".

"spark.serializer" should be "org.apache.spark.serializer.KryoSerializer".

如果不是,则可以使用以下方法在内部进行设置:

If it's not, then you can set this internally with:

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )

关于您的最后一个问题(甚至需要吗?"),很难对此做出一般性声明. Kryo优化了数据通信中缓慢的步骤之一,但是在您的用例中,其他步骤很可能会使您退缩.但是尝试Kryo和基准测试差异没有任何弊端!

Regarding your last question ("is it even needed?"), it's hard to make a general claim about that. Kryo optimizes one of the slow steps in communicating data, but it's entirely possible that in your use case, others are holding you back. But there's no downside to trying Kryo and benchmarking the difference!

这篇关于什么时候在Spark中使用Kryo序列化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆