Spark 中广播对象的最大大小是多少? [英] What is the maximum size for a broadcast object in Spark?

查看:40
本文介绍了Spark 中广播对象的最大大小是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用数据框时 broadcast 函数或 SparkContext broadcast 函数,可以分派给所有执行程序的最大对象大小是多少?

When using Dataframe broadcast function or the SparkContext broadcast functions, what is the maximum object size that can be dispatched to all executors?

推荐答案

broadcast 函数:

默认为 10mb,但我们一直使用到 300 mb,由 spark.sql.autoBroadcastJoinThreshold.

AFAIK,这完全取决于可用内存.所以对此没有明确的答案.我想说的是,它应该小于大型数据帧,您可以估计大或小的数据帧大小,如下所示...

AFAIK, It all depends on memory available. so there is no definite answer for this. what I would say is, it should be less than large dataframe and you can estimate large or small dataframe size like below...

import org.apache.spark.util.SizeEstimator

logInfo(SizeEstimator.estimate(yourlargeorsmalldataframehere))

基于此,您可以将 broadcast 提示传递给框架.

based on this you can pass broadcast hint to framework.

也看看Scala 文档来自sql/execution/SparkStrategies.scala

Also have a look at scala doc from sql/execution/SparkStrategies.scala

其中说....

  • 广播:如果连接的一侧的估计物理尺寸小于用户可配置的尺寸[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] 阈值或如果端有一个明确的广播提示(例如用户应用了
    [[org.apache.spark.sql.functions.broadcast()]] 函数到一个DataFrame),那么连接的那一边将被广播另一边将被流式传输,没有改组
    执行.如果两边都低于阈值,广播较小的一面.如果两者都不小,则不使用 BHJ.
  • Shuffle hash join:如果单个的平均大小分区足够小,可以构建哈希表.
  • 排序合并:如果匹配的连接键是可排序的.
  • 如果没有加入键,则按以下优先级选择加入实现:
    • BroadcastNestedLoopJoin:如果可以广播连接的一侧
    • 笛卡尔积:用于内连接
    • BroadcastNestedLoopJoin

    也看看 other-configuration-选项

    broadcast 共享变量也有一个属性 spark.broadcast.blockSize=4MAFAIK 我也没有看到过硬核限制......

    broadcast shared variable also has a property spark.broadcast.blockSize=4M AFAIK there is no hard core limitation I have seen for this as well...

    欲了解更多信息请.请参阅 TorrentBroadcast.斯卡拉

    for Further information pls. see TorrentBroadcast.scala

    但是,您可以查看 2GB 问题,尽管文档中没有正式声明(我在文档中看不到任何此类内容).请查看 SPARK-6235 处于IN PROGRESS"状态 &SPARK-6235_Design_V0.02.pdf .

    However you can have look at 2GB issue Even though that was officially not declared in docs (I was not able to see anything of this kind in docs). pls look at SPARK-6235 which is "IN PROGRESS" state & SPARK-6235_Design_V0.02.pdf .

    这篇关于Spark 中广播对象的最大大小是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆