超过 `spark.driver.maxResultSize` 而不给驱动程序带来任何数据 [英] Exceeding `spark.driver.maxResultSize` without bringing any data to the driver

查看:28
本文介绍了超过 `spark.driver.maxResultSize` 而不给驱动程序带来任何数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个执行大型连接的 Spark 应用程序

I have a Spark application that performs a large join

val joined = uniqueDates.join(df, $"start_date" <= $"date" && $"date" <= $"end_date")

然后将生成的 DataFrame 聚合成一个可能有 13k 行的数据帧.在加入过程中,作业失败并显示以下错误消息:

and then aggregates the resulting DataFrame down to one with maybe 13k rows. In the course of the join, the job fails with the following error message:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 78021 tasks is bigger than spark.driver.maxResultSize (2.0 GB)

这是在没有设置 spark.driver.maxResultSize 之前发生的,所以我设置了 spark.driver.maxResultSize=2G.然后,我对连接条件进行了轻微更改,错误再次出现.

This was happening before without setting spark.driver.maxResultSize, and so I set spark.driver.maxResultSize=2G. Then, I made a slight change to the join condition, and the error resurfaces.

在调整集群大小时,我还将 DataFrame 在 .coalesce(256) 中假定的分区数量加倍到 .coalesce(512),所以我不能确定这不是因为那个.

In resizing the cluster, I also doubled the number of partitions the DataFrame assumes in a .coalesce(256) to a .coalesce(512), so I can't be sure it's not because of that.

我的问题是,既然我没有向驱动程序收集任何东西,为什么 spark.driver.maxResultSize 在这里很重要?驱动程序的内存是否用于我不知道的连接中的某些内容?

My question is, since I am not collecting anything to the driver, why should spark.driver.maxResultSize matter at all here? Is the driver's memory being used for something in the join that I'm not aware of?

推荐答案

仅仅因为你没有明确收集任何东西,并不意味着没有收集任何东西.由于问题发生在连接期间,最可能的解释是执行计划使用广播连接.在这种情况下,Spark 会先收集数据,然后广播它.

Just because you don't collect anything explicitly it doesn't mean that nothing is collected. Since the problem occurs during a join, the most likely explanation is that execution plan uses broadcast join. In that case Spark will collect data first, and then broadcast it.

取决于配置和管道:

  • 确保 spark.sql.autoBroadcastJoinThreshold 小于 spark.driver.maxResultSize.
  • 确保您没有强制广播加入未知大小的数据.
  • 虽然没有任何迹象表明这是这里的问题,但在使用 Spark ML 实用程序时要小心.其中一些(最显着的索引器)可以为驱动程序带来大量数据.
  • Make sure that spark.sql.autoBroadcastJoinThreshold is smaller than spark.driver.maxResultSize.
  • Make sure you don't force broadcast join on a data of unknown size.
  • While nothing indicates it is the problem here, be careful when using Spark ML utilities. Some of these (most notably indexers) can bring significant amounts of data to the driver.

要确定广播是否确实是问题,请检查执行计划,如果需要,删除广播提示并禁用自动广播:

To determine if broadcasting is indeed the problem please check the execution plan, and if needed, remove broadcast hints and disable automatic broadcasts:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

这篇关于超过 `spark.driver.maxResultSize` 而不给驱动程序带来任何数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆