任务序列化结果的总大小大于 spark.driver.maxResultSize [英] Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

查看：20 发布时间：2021/12/22 21:37:45 apache-spark pyspark

本文介绍了任务序列化结果的总大小大于 spark.driver.maxResultSize的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

美好的一天.

我正在运行用于解析一些日志文件的开发代码.如果我尝试解析较少的文件，我的代码将顺利运行.但是当我增加需要解析的日志文件数量时，它会返回不同的错误，例如打开的文件太多和任务序列化结果的总大小大于spark.driver.最大结果大小.

I am running a development code for parsing some log files. My code will run smoothly if I tried to parse less files. But as I increase the number of log files I need to parse, it will return different errors such as too many open files and Total size of serialized results of tasks is bigger than spark.driver.maxResultSize.

我尝试增加 spark.driver.maxResultSize 但错误仍然存在.

I tried to increase the spark.driver.maxResultSize but the error still persists.

你能就如何解决这个问题给我任何想法吗?

Can you give me any ideas on how to resolve this issue?

谢谢.

推荐答案

Total size of serialized results of tasks is大于 spark.driver.maxResultSize 表示当执行程序试图将其结果发送到驱动程序，它超过了 spark.driver.maxResultSize.可能的解决方案是@mayank agrawal 上面提到的，以继续增加它直到你让它工作(如果执行者试图发送太多数据，则不是推荐的解决方案).

Total size of serialized results of tasks is bigger than spark.driver.maxResultSize means when a executor is trying to send its result to driver, it exceeds spark.driver.maxResultSize. Possible solution is as mentioned above by @mayank agrawal to keep on increasing it till you get it to work (not a recommended solution if an executor is trying to send too much data ).

我建议查看您的代码，看看数据是否有偏差，这是否使执行程序之一完成大部分工作，从而导致大量数据输入/输出.如果数据倾斜，您可以尝试重新分区它.

I would suggest looking into your code and see if the data is skewed that is making one of the executor to do most of the work resulting in a lot of data in/out. If data is skewed you could try repartitioning it.

对于太多打开的文件问题，可能的原因是 Spark 可能在 shuffle 之前创建了许多中间文件.如果在执行程序/高并行性或唯一键中使用了太多内核(可能是因为您的情况 - 大量输入文件)，则可能会发生这种情况.要研究的一种解决方案是通过此标志合并大量中间文件:--conf spark.shuffle.consolidateFiles=true(当您执行 spark-submit 时)

for too many open files issues , possible cause is Spark might be creating a number of intermediate files before shuffle. could happen if too many cores being used in executor/high parallelism or unique keys (possible cause in your case - huge number of input files). One solution to look into is consolidating the huge number of intermediate files through this flag : --conf spark.shuffle.consolidateFiles=true (when you do spark-submit)

要检查的另一件事是此线程(如果与您的用例类似):https://issues.apache.org/jira/browse/SPARK-12837

One more thing to check is this thread (if that something similar to your use case): https://issues.apache.org/jira/browse/SPARK-12837

这篇关于任务序列化结果的总大小大于 spark.driver.maxResultSize的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

任务序列化结果的总大小大于 spark.driver.maxResultSize [英] Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

任务序列化结果的总大小大于 spark.driver.maxResultSize [英] Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭