使用Pyspark时,您会从Kryo序列化程序中受益吗? [英] Do you benefit from the Kryo serializer when you use Pyspark?
问题描述
我读到Kryo序列化程序在Apache Spark中使用时可以提供更快的序列化.但是,我正在通过Python使用Spark.
I read that the Kryo serializer can provide faster serialization when used in Apache Spark. However, I'm using Spark through Python.
切换到Kryo序列化器还能从中获得显着的好处吗?
Do I still get notable benefits from switching to the Kryo serializer?
推荐答案
Kryo
不会对 PySpark
产生重大影响,因为它只是将数据存储为byte []
对象,即使使用Java也可以快速序列化.
Kryo
won’t make a major impact on PySpark
because it just stores data as byte[]
objects, which are fast to serialize even with Java.
但是可能值得一试-您只需设置 spark.serializer
配置,并尝试不注册任何类.
But it may be worth a try — you would just set the spark.serializer
configuration and trying not to register any classe.
可能会产生更大影响的是将您的数据存储为 MEMORY_ONLY_SER
并启用 spark.rdd.compress
,这会将它们压缩为数据.
What might make more impact is storing your data as MEMORY_ONLY_SER
and enabling spark.rdd.compress
, which will compress them your data.
在 Java 中,这可能会增加一些CPU开销,但是 Python 的运行速度要慢得多,因此可能无关紧要.它还可能通过减少GC或让您缓存更多数据来加快计算速度.
In Java this can add some CPU overhead, but Python runs quite a bit slower, so it might not matter. It might also speed up computation by reducing GC or letting you cache more data.
参考资料: Matei Zaharia的答案在邮件列表中.
Reference : Matei Zaharia's answer in the mailing list.
这篇关于使用Pyspark时,您会从Kryo序列化程序中受益吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!