Apache Spark的性能调优 [英] Apache Spark's performance tuning
问题描述
我正在一个项目中,我必须调整spark的性能.我发现了四个最重要的参数,这些参数将有助于调整spark的性能.它们如下:
I am working on a project where in I have to tune spark's performance. I have found four most important parameters that will help in tuning spark's performance. They are as follows:
- spark.memory.fraction
- spark.memory.offHeap.size
- spark.storage.memoryFraction
- spark.shuffle.memoryFraction
我想知道我是否朝着正确的方向前进?请让我知道我是否也错过了其他一些参数.
I wanted to know whether I am going in the right direction or not? Please let me know if I missed out on some other parameters also.
谢谢.
推荐答案
诚实回答这个问题范围很广.在有关 Tuning Spark .
This is is quite broad to answer honestly. The right path to optimize performance is mainly described in the official documentation in the section concerning Tuning Spark.
通常来说,优化火花作业的因素很多:
Generally speaking, there is lots of factors to optimize spark jobs :
- 数据序列化
- 内存调整
- 并行度
- 减少任务的内存使用量
- 广播大变量
- 数据位置
它主要集中在数据序列化,内存调整以及精度/逼近技术之间的权衡,以快速完成工作.
It's mainly centralized around data serialization, memory tuning and a trade-off between precision/approximation techniques to get the job done fast.
由@ zero323提供:
Courtesy of @zero323 :
我要指出的是,问题中提到的所有选项(仅一个选项)都已弃用,并且仅在旧版模式下使用.
I'd point out, that all but one option mentioned in the question, are deprecated and used only in legacy mode.
这篇关于Apache Spark的性能调优的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!