任务的序列化结果的总大小大于spark.driver.maxResultSize [英] Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

查看：380 发布时间：2020/9/4 1:54:56 apache-spark pyspark

本文介绍了任务的序列化结果的总大小大于spark.driver.maxResultSize的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

美好的一天.

我正在运行一个用于解析某些日志文件的开发代码.如果我尝试解析更少的文件，我的代码将流畅运行.但是随着我需要解析的日志文件数量的增加，它将返回不同的错误，例如too many open files和Total size of serialized results of tasks is bigger than spark.driver.maxResultSize.

I am running a development code for parsing some log files. My code will run smoothly if I tried to parse less files. But as I increase the number of log files I need to parse, it will return different errors such as too many open files and Total size of serialized results of tasks is bigger than spark.driver.maxResultSize.

我试图增加spark.driver.maxResultSize，但错误仍然存在.

I tried to increase the spark.driver.maxResultSize but the error still persists.

您能给我有关如何解决此问题的任何想法吗?

Can you give me any ideas on how to resolve this issue?

谢谢.

推荐答案

Total size of serialized results of tasks is bigger than spark.driver.maxResultSize意味着当执行者试图将其结果发送给驱动程序时，它超过了spark.driver.maxResultSize.上面@mayank agrawal提到了可能的解决方案，以使其不断增加直到您开始使用为止(如果执行程序尝试发送太多数据，则不建议使用此解决方案).

Total size of serialized results of tasks is bigger than spark.driver.maxResultSize means when a executor is trying to send its result to driver, it exceeds spark.driver.maxResultSize. Possible solution is as mentioned above by @mayank agrawal to keep on increasing it till you get it to work (not a recommended solution if an executor is trying to send too much data ).

我建议查看您的代码，看看数据是否歪斜，这使执行程序之一可以执行大部分工作，从而导致大量数据输入/输出.如果数据偏斜，可以尝试repartitioning.

I would suggest looking into your code and see if the data is skewed that is making one of the executor to do most of the work resulting in a lot of data in/out. If data is skewed you could try repartitioning it.

由于太多打开文件问题，可能的原因是Spark可能在随机播放之前创建了许多中间文件.如果在执行程序/高度并行性或唯一键中使用了太多内核，则可能会发生(在您的情况下，可能的原因-输入文件数量庞大).要研究的一种解决方案是通过此标志整合大量中间文件:--conf spark.shuffle.consolidateFiles=true(当您执行spark-submit时)

for too many open files issues , possible cause is Spark might be creating a number of intermediate files before shuffle. could happen if too many cores being used in executor/high parallelism or unique keys (possible cause in your case - huge number of input files). One solution to look into is consolidating the huge number of intermediate files through this flag : --conf spark.shuffle.consolidateFiles=true (when you do spark-submit)

需要检查的另一件事是该线程(如果该线程与您的用例相似): https://issues.apache.org/jira/browse/SPARK-12837

One more thing to check is this thread (if that something similar to your use case): https://issues.apache.org/jira/browse/SPARK-12837

这篇关于任务的序列化结果的总大小大于spark.driver.maxResultSize的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

任务的序列化结果的总大小大于spark.driver.maxResultSize [英] Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

任务的序列化结果的总大小大于spark.driver.maxResultSize [英] Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭