随着数据集和列数的增加,Spark作业的执行时间呈指数增长 [英] Spark job execution time exponentially increases with very wide dataset and number of columns
本文介绍了随着数据集和列数的增加,Spark作业的执行时间呈指数增长的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在spark中创建了一个固定宽度的文件导入解析器,并对各种数据集执行了一些执行测试. 它最多可以处理1000列,但是,随着列数和固定宽度长度的增加,Spark作业性能会迅速下降.在20k列和固定宽度长度超过10万的列上执行需要花费大量时间.
I have created a fixed width file import parser in spark and performed a few execution test on various datasets. It works fine up to 1000 columns, but, as the number of columns and fixed width length increases, Spark job performance decreases rapidly. It takes a lot of time to execute on 20k columns and fixed width length more than 100 thousand.
这可能是什么原因? 如何提高性能?
What are the possible reasons for this? How can I improve the performance?
我发现的类似问题之一:
One of the similar issues I found:
查看全文